# Michael L. LittmanBrown University · Department of Computer Science

Michael L. Littman

PhD

## About

297

Publications

83,422

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

30,079

Citations

Introduction

**Skills and Expertise**

Additional affiliations

July 2002 - June 2012

January 2000 - March 2002

## Publications

Publications (297)

Reward is the driving force for reinforcement-learning agents. We here set out to understand the expressivity of Markov reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of "task": (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial o...

In recent years, researchers have made significant progress in devising reinforcement-learning algorithms for optimizing linear temporal logic (LTL) objectives and LTL-like objectives. Despite these advancements, there are fundamental limitations to how well this problem can be solved. Previous studies have alluded to this fact but have not examine...

One of the most striking features of human cognition is the ability to plan. Two aspects of human planning stand out—its efficiency and flexibility. Efficiency is especially impressive because plans must often be made in complex environments, and yet people successfully plan solutions to many everyday problems despite having limited cognitive resou...

We employ Proximal Iteration for value-function optimization in reinforcement learning. Proximal Iteration is a computationally efficient technique that enables us to bias the optimization procedure towards more desirable solutions. As a concrete application of Proximal Iteration in deep reinforcement learning, we endow the objective function of th...

Reward is the driving force for reinforcement-learning agents. This paper is dedicated to understanding the expressivity of reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of "task" that might be desirable: (1) a set of acceptable behaviors, (2) a partial ordering over b...

Reinforcement learning is hard in general. Yet, in many specific environments, learning is easy. What makes learning easy in one environment, but difficult in another? We address this question by proposing a simple measure of reinforcement-learning hardness called the bad-policy density. This quantity measures the fraction of the deterministic stat...

Fluid human-agent communication is essential for the future of human-in-the-loop reinforcement learning. An agent must respond appropriately to feedback from its human trainer even before they have significant experience working together. Therefore, it is important that learning agents respond well to various feedback schemes human trainers are lik...

Theory of mind enables an observer to interpret others' behavior in terms of unobservable beliefs, desires, intentions, feelings, and expectations about the world. This also empowers the person whose behavior is being observed: By intelligently modifying her actions, she can influence the mental representations that an observer ascribes to her, and...

We consider the problem of knowledge transfer when an agent is facing a series of Reinforcement Learning (RL) tasks. We introduce a novel metric between Markov Decision Processes and establish that close MDPs have close optimal value functions. Formally, the optimal value functions are Lipschitz continuous with respect to the tasks space. These the...

A core operation in reinforcement learning (RL) is finding an action that is optimal with respect to a learned value function. This operation is often challenging when the learned value function takes continuous actions as input. We introduce deep radial-basis value functions (RBVFs): value functions learned using a deep network with a radial-basis...

In this work, we propose and explore Deep Graph Value Network (DeepGV) as a promising method to work around sample complexity in deep reinforcement-learning agents using a message-passing mechanism. The main idea is that the agent should be guided by structured non-neural-network algorithms like dynamic programming. According to recent advances in...

One of the most striking features of human cognition is the capacity to plan. Two aspects of human planning stand out: its efficiency, even in complex environments, and its flexibility, even in changing environments. Efficiency is especially impressive because directly computing an optimal plan is intractable, even for modestly complex tasks, and y...

Two common approaches for automating IoT smart spaces are having users write rules using trigger-action programming (TAP) or training machine learning models based on observed actions. In this paper, we unite these approaches. We introduce and evaluate Trace2TAP, a novel method for automatically synthesizing TAP rules from traces (time-stamped logs...

Planning is useful. It lets people take actions that have desirable long-term consequences. But, planning is hard. It requires thinking about consequences, which consumes limited computational and cognitive resources. Thus, people should plan their actions, but they should also be smart about how they deploy resources used for planning their action...

Can simple algorithms with a good representation solve challenging reinforcement learning problems? In this work, we answer this question in the affirmative, where we take "simple learning algorithm" to be tabular Q-Learning, the "good representations" to be a learned state abstraction, and "challenging problems" to be continuous control tasks. Our...

A core operation in reinforcement learning (RL) is finding an action that is optimal with respect to a learned state-action value function. This operation is often challenging when the learned value function takes continuous actions as input. We introduce deep RBF value functions: state-action value functions learned using a deep neural network wit...

We consider the problem of knowledge transfer when an agent is facing a series of Reinforcement Learning (RL) tasks. We introduce a novel metric between Markov Decision Processes and establish that close MDPs have close optimal value functions. Formally, the optimal value functions are Lipschitz continuous with respect to the tasks space. These the...

Because machine learning would benefit from reduced data requirements, some prior work has proposed using humans not just to label data, but also to explain those labels. To characterize the evidence humans might want to provide, we conducted a user study and a data experiment. In the user study, 75 participants provided classification labels for 2...

In order for robots and other artificial agents to efficiently learn to perform useful tasks defined by an end user, they must understand not only the goals of those tasks, but also the structure and dynamics of that user's environment. While existing work has looked at how the goals of a task can be inferred from a human teacher, the agent is ofte...

Human social behavior is structured by relationships. We form teams, groups, tribes, and alliances at all scales of human life. These structures guide multi-agent cooperation and competition, but when we observe others these underlying relationships are typically unobservable and hence must be inferred. Humans make these inferences intuitively and...

ion can give rise to models of environments that are both compressed and useful, thereby enabling efficient sequential decision making. In this work, we offer the first formalism and analysis of the trade-off between compression and performance made in the context of state abstraction for Apprenticeship Learning. We build on Rate-Distortion theory,...

Agents that can make better use of computation, experience, time, and memory can solve a greater range of problems more effectively. A crucial ingredient for managing such finite resources is intelligently chosen abstract representations. But, how do abstractions facilitate problem solving under limited resources? What makes an abstraction useful?...

Trigger-action programming (TAP) is a programming model enabling users to connect services and devices by writing if-then rules. As such systems are deployed in increasingly complex scenarios, users must be able to identify programming bugs and reason about how to fix them. We first systematize the temporal paradigms through which TAP systems could...

Carrots and sticks motivate behavior, and people can teach new behaviors to other organisms, such as children or nonhuman animals, by tapping into their reward learning mechanisms. But how people teach with reward and punishment depends on their expectations about the learner. We examine how people teach using reward and punishment by contrasting t...

Theory of mind enables an observer to interpret others’ behavior in terms of unobservable beliefs, desires, intentions, feelings, and expectations about the world. This also empowers the person whose behavior is being observed: By intelligently modifying her actions, she can influence the mental representations that an observer ascribes to her, and...

Human social behavior is structured by relationships. We form teams, groups, tribes, and alliances at all scales of human life. These structures guide multi-agent cooperation and competition, but when we observe others these underlying relationships are typically unobservable and hence must be inferred. Humans make these inferences intuitively and...

Existing work in machine learning has shown that algorithms can benefit from the use of curricula—learning first on simple examples before moving to more difficult problems. This work studies the curriculum-design problem in the context of sequential decision tasks, analyzing how different curricula affect learning in a Sokoban-like domain, and pre...

Natural selection designs some social behaviors to depend on flexible learning processes, whereas others are relatively rigid or reflexive. What determines the balance between these two approaches? We offer a detailed case study in the context of a two-player game with antisocial behavior and retaliatory punishment. We show that each player in this...

This paper investigates the problem of interactively learning behaviors communicated by a human teacher using positive and negative feedback. Much previous work on this problem has made the assumption that people provide feedback for decisions that is dependent on the behavior they are teaching and is independent from the learner's current policy....

Background
Computer-delivered interventions have been shown to be effective in reducing alcohol consumption in heavy drinking college students. However, these computer-delivered interventions rely on mouse, keyboard, or touchscreen responses for interactions between the users and the computer-delivered intervention. The principles of motivational i...

We propose a new task-specification language for Markov decision processes that is designed to be an improvement over reward functions by being environment independent. The language is a variant of Linear Temporal Logic (LTL) that is extended to probabilistic specifications in a way that permits approximations to be learned in finite time. We provi...

Humans often attempt to influence one another’s behavior using rewards and punishments. How does this work? Psychologists have often assumed that “evaluative feedback” influences behavior via standard learning mechanisms that learn from environmental contingencies. On this view, teaching with evaluative feedback involves leveraging learning systems...

For agents and robots to become more useful, they must be able to quickly learn from non-technical users. This paper investigates the problem of interactively learning behaviors communicated by a human teacher using positive and negative feedback. Much previous work on this problem has made the assumption that people provide feedback for decisions...

While researchers have long investigated end-user programming using a trigger-action (if-then) model, the website IFTTT is among the first instances of this paradigm being used on a large scale. To understand what IFTTT users are creating, we scraped the 224,590 programs shared publicly on IFTTT as of September 2015 and are releasing this dataset t...

Successfully navigating the social world requires reasoning about both high-level strategic goals, such as whether to cooperate or compete, as well as the low-level actions needed to achieve those goals. While previous work in experimental game theory has examined the former and work on multi-agent systems has examined the later, there has been lit...

Reinforcement learning is a branch of machine learning concerned with using experience gained through interacting with the world and evaluative feedback to improve a system's ability to make behavioural decisions. It has been called the artificial intelligence problem in a microcosm because learning algorithms must act autonomously to perform well...

For real-world applications, virtual agents must be able to learn new behaviors from non-technical users. Positive and negative feedback are an intuitive way to train new behaviors, and existing work has presented algorithms for learning from such feedback. That work, however, treats feedback as numeric reward to be maximized, and assumes that all...

We investigate the practicality of letting average users customize smart-home devices using trigger-action ("if, then") programming. We find trigger-action programming can express most desired behaviors submitted by participants in an online study. We identify a class of triggers requiring machine learning that has received little attention. We eva...

Markov decision problems (MDPs) provide the foundations for a number of
problems of interest to AI researchers studying automated planning and
reinforcement learning. In this paper, we summarize results regarding the
complexity of solving MDPs and the running time of MDP solution algorithms. We
argue that, although MDPs can be solved efficiently in...

Most exact algorithms for general partially observable Markov decision
processes (POMDPs) use a form of dynamic programming in which a
piecewise-linear and convex representation of one value function is transformed
into another. We examine variations of the "incremental pruning" method for
solving this problem and compare them to earlier algorithms...

In this work, we introduce graphical modelsfor multi-player game theory, and
give powerful algorithms for computing their Nash equilibria in certain cases.
An n-player game is given by an undirected graph on n nodes and a set of n
local matrices. The interpretation is that the payoff to player i is determined
entirely by the actions of player i and...

This paper considers the problem of learning task specific web-service descriptions from traces of users successfully completing a task. Unlike prior approaches, we take a traditional machine-learning perspective to the construction of web-service models from data. Our representation models both syntactic features of web-service schemas (including...

Two-player complete-information game trees are perhaps the simplest possible
setting for studying general-sum games and the computational problem of finding
equilibria. These games admit a simple bottom-up algorithm for finding subgame
perfect Nash equilibria efficiently. However, such an algorithm can fail to
identify optimal equilibria, such as t...

Model-based learning algorithms have been shown to use experience efficiently
when learning to solve Markov Decision Processes (MDPs) with finite state and
action spaces. However, their high computational cost due to repeatedly solving
an internal model inhibits their use in large-scale problems. We propose a
method based on real-time dynamic progr...

Continuous state spaces and stochastic, switching dynamics characterize a
number of rich, realworld domains, such as robot navigation across varying
terrain. We describe a reinforcementlearning algorithm for learning in these
domains and prove for certain environments the algorithm is probably
approximately correct with a sample complexity that sca...

We present a polynomial-time algorithm that always finds an (approximate)
Nash equilibrium for repeated two-player stochastic games. The algorithm
exploits the folk theorem to derive a strategy profile that forms an
equilibrium by buttressing mutually beneficial behavior with threats, where
possible. One component of our algorithm efficiently searc...

This paper addresses the problem of training an artificial agent to follow verbal instructions representing high-level tasks using a set of instructions paired with demonstration traces of appropriate behavior. From this data, a mapping from instructions to tasks is learned, enabling the agent to carry out new instructions in novel environments.

This article presents a population-based cognitive hierarchy model that can be used to estimate the reasoning depth and sophistication of a collection of opponents' strategies from observed behavior in repeated games. This framework provides a compact representation of a distribution of complicated strategies by reducing them to a small number of p...

This paper presents a new algorithm for online linear regression whose
efficiency guarantees satisfy the requirements of the KWIK (Knows What It
Knows) framework. The algorithm improves on the complexity bounds of the
current state-of-the-art procedure in this setting. We explore several
applications of this algorithm for learning compact reinforce...

We present a modular approach to reinforcement learning that uses a Bayesian
representation of the uncertainty over models. The approach, BOSS (Best of
Sampled Set), drives exploration by sampling multiple models from the posterior
and selecting actions optimistically. It extends previous work by providing a
rule for deciding when to resample and h...

The Communications Web site, http://cacm.acm.org, features more than a dozen bloggers in the BLOG@CACM community. In each issue of Communications, we'll publish selected posts or excerpts.twitterFollow us on Twitter ...

Bayes-optimal behavior, while well-defined, is often difficult to
achieve. Recent advances in the use of Monte-Carlo tree search (MCTS)
have shown that it is possible to act near-optimally in Markov Decision
Processes (MDPs) with very large or infinite state spaces. Bayes-optimal
behavior in an unknown MDP is equivalent to optimal behavior in the
k...

Finding a meaningful way of characterizing the difficulty of partially observable Markov decision processes (POMDPs) is a core theoretical problem in POMDP research. State-space size is often used as a proxy for POMDP difficulty, but it is a weak metric at best. Existing work has shown that the covering number for the reachable belief space, which...

We show that the problem of finding an optimal stochastic 'blind' controller
in a Markov decision process is an NP-hard problem. The corresponding decision
problem is NP-hard, in PSPACE, and SQRT-SUM-hard, hence placing it in NP would
imply breakthroughs in long-standing open problems in computer science. Our
result establishes that the more genera...

Although household devices and home appliances function more and more as network-connected computers, they don’t provide programming
interfaces for the average user. We first identify the programming primitives and control structures necessary for the universal
programming of devices. We then propose a mapping between the features necessary for the...

The special issue of Machine Learning focused on introducing the empirical evaluations in reinforcement learning. It was demonstrated that reinforcement-learning approaches were evaluated in one or more of three ways, such as subjectively, theoretically, and empirically. Subjective evaluations, in which researchers assessed the significance and pot...

The nodes in a wireless ad hoc network act as routers in a self-configuring network without infrastructure. An application running on the nodes in the ad hoc network may require that intermediate nodes act as routers, receiving and forwarding data packets to other nodes to overcome the limitations of noise, router congestion and limited transmissio...

For the problem of model choice in linear regression, we introduce a Bayesian adaptive sampling algorithm (BAS), that samples models without replacement from the space of models. For problems that permit enumeration of all models, BAS is guaranteed to enumerate the model space in 2p iterations where p is the number of potential variables under cons...

The First Trading Agent Competition (TAC) was held from June 22nd to July
8th, 2000. TAC was designed to create a benchmark problem in the complex domain
of e-marketplaces and to motivate researchers to apply unique approaches to a
common task. This article describes ATTac-2000, the first-place finisher in
TAC. ATTac-2000 uses a principled bidding...

Lexicographic preference models (LPMs) are an intuitive representation that corresponds to many real-world preferences exhibited by human decision makers. Previous algorithms for learning LPMs produce a “best guess” LPM that is consistent with the observations. Our approach is more democratic: we do not commit to a single LPM. Instead, we approxima...

We introduce a learning framework that combines elements of the well-known PAC and mistake-bound models. The KWIK (knows what it knows) framework was designed particularly for its utility in learning settings where active exploration can impact the training examples the learner is exposed to, as is true in reinforcement-learning and active-learning...

A good (a professional soft-serve ice cream maker, as it turns out) is being raffled off by an auctioneer. Bidders can buy tickets in the raffle for $1 apiece and at the end of the raffle the good is assigned to one of the bidders. The probability of assigning the good to a given bidder is proportional to the number of tickets he or she bought. Fiv...

Most Relevant Explanation (MRE) is the problem of finding a partial instantiation of a set of target variables that maximizes the generalized Bayes factor as the explanation for given evidence in a Bayesian network. MRE has a huge solution space and is extremely difficult to solve in large Bayesian networks. In this paper, we first prove that MRE i...

Designing the dialogue policy of a spoken dialogue system involves many nontrivial choices. This paper presents a reinforcement learning approach for automatically optimizing a dialogue policy, which addresses the technical challenges in applying reinforcement learning to a working dialogue system with human users. We report on the design, construc...

The nodes in a wireless ad hoc network act as routers in a self-configuring network without infrastructure. An application running on the nodes in the ad hoc network may require that intermediate nodes act as routers, receiving and forwarding data packets to other nodes to overcome the limitations of noise, router congestion and limited transmissio...

This paper examines how the choice of the selection mechanism in an evolutionary algorithm impacts the objective function it optimizes, specifically when the fitness function is noisy. We provide formal results showing that, in an abstract infinite-population model, proportional selection optimizes expected fitness, truncation selection optimizes o...

In this paper, we present a new algorithm that integrates recent advances in solving continuous bandit problems with sample-based rollout methods for planning in Markov Decision Processes (MDPs). Our algorithm, Hierarchical Optimistic Optimization applied to Trees (HOOT) addresses planning in continuous-action MDPs. Empirical results are given that...