Michael L. Littman

Michael L. Littman
Brown University · Department of Computer Science

PhD

About

297
Publications
83,422
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
30,079
Citations
Additional affiliations
July 2002 - June 2012
Rutgers, The State University of New Jersey
Position
  • Professor (Full)
January 2000 - March 2002
AT&T
Position
  • Principal member of technical staff

Publications

Publications (297)
Conference Paper
Reward is the driving force for reinforcement-learning agents. We here set out to understand the expressivity of Markov reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of "task": (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial o...
Conference Paper
In recent years, researchers have made significant progress in devising reinforcement-learning algorithms for optimizing linear temporal logic (LTL) objectives and LTL-like objectives. Despite these advancements, there are fundamental limitations to how well this problem can be solved. Previous studies have alluded to this fact but have not examine...
Article
Full-text available
One of the most striking features of human cognition is the ability to plan. Two aspects of human planning stand out—its efficiency and flexibility. Efficiency is especially impressive because plans must often be made in complex environments, and yet people successfully plan solutions to many everyday problems despite having limited cognitive resou...
Preprint
We employ Proximal Iteration for value-function optimization in reinforcement learning. Proximal Iteration is a computationally efficient technique that enables us to bias the optimization procedure towards more desirable solutions. As a concrete application of Proximal Iteration in deep reinforcement learning, we endow the objective function of th...
Preprint
Reward is the driving force for reinforcement-learning agents. This paper is dedicated to understanding the expressivity of reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of "task" that might be desirable: (1) a set of acceptable behaviors, (2) a partial ordering over b...
Preprint
Reinforcement learning is hard in general. Yet, in many specific environments, learning is easy. What makes learning easy in one environment, but difficult in another? We address this question by proposing a simple measure of reinforcement-learning hardness called the bad-policy density. This quantity measures the fraction of the deterministic stat...
Preprint
Full-text available
Fluid human-agent communication is essential for the future of human-in-the-loop reinforcement learning. An agent must respond appropriately to feedback from its human trainer even before they have significant experience working together. Therefore, it is important that learning agents respond well to various feedback schemes human trainers are lik...
Article
Theory of mind enables an observer to interpret others' behavior in terms of unobservable beliefs, desires, intentions, feelings, and expectations about the world. This also empowers the person whose behavior is being observed: By intelligently modifying her actions, she can influence the mental representations that an observer ascribes to her, and...
Article
We consider the problem of knowledge transfer when an agent is facing a series of Reinforcement Learning (RL) tasks. We introduce a novel metric between Markov Decision Processes and establish that close MDPs have close optimal value functions. Formally, the optimal value functions are Lipschitz continuous with respect to the tasks space. These the...
Article
A core operation in reinforcement learning (RL) is finding an action that is optimal with respect to a learned value function. This operation is often challenging when the learned value function takes continuous actions as input. We introduce deep radial-basis value functions (RBVFs): value functions learned using a deep network with a radial-basis...
Article
In this work, we propose and explore Deep Graph Value Network (DeepGV) as a promising method to work around sample complexity in deep reinforcement-learning agents using a message-passing mechanism. The main idea is that the agent should be guided by structured non-neural-network algorithms like dynamic programming. According to recent advances in...
Preprint
One of the most striking features of human cognition is the capacity to plan. Two aspects of human planning stand out: its efficiency, even in complex environments, and its flexibility, even in changing environments. Efficiency is especially impressive because directly computing an optimal plan is intractable, even for modestly complex tasks, and y...
Article
Two common approaches for automating IoT smart spaces are having users write rules using trigger-action programming (TAP) or training machine learning models based on observed actions. In this paper, we unite these approaches. We introduce and evaluate Trace2TAP, a novel method for automatically synthesizing TAP rules from traces (time-stamped logs...
Preprint
Planning is useful. It lets people take actions that have desirable long-term consequences. But, planning is hard. It requires thinking about consequences, which consumes limited computational and cognitive resources. Thus, people should plan their actions, but they should also be smart about how they deploy resources used for planning their action...
Preprint
Can simple algorithms with a good representation solve challenging reinforcement learning problems? In this work, we answer this question in the affirmative, where we take "simple learning algorithm" to be tabular Q-Learning, the "good representations" to be a learned state abstraction, and "challenging problems" to be continuous control tasks. Our...
Preprint
A core operation in reinforcement learning (RL) is finding an action that is optimal with respect to a learned state-action value function. This operation is often challenging when the learned value function takes continuous actions as input. We introduce deep RBF value functions: state-action value functions learned using a deep neural network wit...
Preprint
We consider the problem of knowledge transfer when an agent is facing a series of Reinforcement Learning (RL) tasks. We introduce a novel metric between Markov Decision Processes and establish that close MDPs have close optimal value functions. Formally, the optimal value functions are Lipschitz continuous with respect to the tasks space. These the...
Chapter
Because machine learning would benefit from reduced data requirements, some prior work has proposed using humans not just to label data, but also to explain those labels. To characterize the evidence humans might want to provide, we conducted a user study and a data experiment. In the user study, 75 participants provided classification labels for 2...
Preprint
In order for robots and other artificial agents to efficiently learn to perform useful tasks defined by an end user, they must understand not only the goals of those tasks, but also the structure and dynamics of that user's environment. While existing work has looked at how the goals of a task can be inferred from a human teacher, the agent is ofte...
Article
Human social behavior is structured by relationships. We form teams, groups, tribes, and alliances at all scales of human life. These structures guide multi-agent cooperation and competition, but when we observe others these underlying relationships are typically unobservable and hence must be inferred. Humans make these inferences intuitively and...
Article
ion can give rise to models of environments that are both compressed and useful, thereby enabling efficient sequential decision making. In this work, we offer the first formalism and analysis of the trade-off between compression and performance made in the context of state abstraction for Apprenticeship Learning. We build on Rate-Distortion theory,...
Preprint
Agents that can make better use of computation, experience, time, and memory can solve a greater range of problems more effectively. A crucial ingredient for managing such finite resources is intelligently chosen abstract representations. But, how do abstractions facilitate problem solving under limited resources? What makes an abstraction useful?...
Conference Paper
Trigger-action programming (TAP) is a programming model enabling users to connect services and devices by writing if-then rules. As such systems are deployed in increasingly complex scenarios, users must be able to identify programming bugs and reason about how to fix them. We first systematize the temporal paradigms through which TAP systems could...
Article
Carrots and sticks motivate behavior, and people can teach new behaviors to other organisms, such as children or nonhuman animals, by tapping into their reward learning mechanisms. But how people teach with reward and punishment depends on their expectations about the learner. We examine how people teach using reward and punishment by contrasting t...
Preprint
Theory of mind enables an observer to interpret others’ behavior in terms of unobservable beliefs, desires, intentions, feelings, and expectations about the world. This also empowers the person whose behavior is being observed: By intelligently modifying her actions, she can influence the mental representations that an observer ascribes to her, and...
Preprint
Human social behavior is structured by relationships. We form teams, groups, tribes, and alliances at all scales of human life. These structures guide multi-agent cooperation and competition, but when we observe others these underlying relationships are typically unobservable and hence must be inferred. Humans make these inferences intuitively and...
Article
Existing work in machine learning has shown that algorithms can benefit from the use of curricula—learning first on simple examples before moving to more difficult problems. This work studies the curriculum-design problem in the context of sequential decision tasks, analyzing how different curricula affect learning in a Sokoban-like domain, and pre...
Article
Natural selection designs some social behaviors to depend on flexible learning processes, whereas others are relatively rigid or reflexive. What determines the balance between these two approaches? We offer a detailed case study in the context of a two-player game with antisocial behavior and retaliatory punishment. We show that each player in this...
Conference Paper
Full-text available
This paper investigates the problem of interactively learning behaviors communicated by a human teacher using positive and negative feedback. Much previous work on this problem has made the assumption that people provide feedback for decisions that is dependent on the behavior they are teaching and is independent from the learner's current policy....
Article
Full-text available
Background Computer-delivered interventions have been shown to be effective in reducing alcohol consumption in heavy drinking college students. However, these computer-delivered interventions rely on mouse, keyboard, or touchscreen responses for interactions between the users and the computer-delivered intervention. The principles of motivational i...
Article
We propose a new task-specification language for Markov decision processes that is designed to be an improvement over reward functions by being environment independent. The language is a variant of Linear Temporal Logic (LTL) that is extended to probabilistic specifications in a way that permits approximations to be learned in finite time. We provi...
Article
Humans often attempt to influence one another’s behavior using rewards and punishments. How does this work? Psychologists have often assumed that “evaluative feedback” influences behavior via standard learning mechanisms that learn from environmental contingencies. On this view, teaching with evaluative feedback involves leveraging learning systems...
Article
For agents and robots to become more useful, they must be able to quickly learn from non-technical users. This paper investigates the problem of interactively learning behaviors communicated by a human teacher using positive and negative feedback. Much previous work on this problem has made the assumption that people provide feedback for decisions...
Conference Paper
While researchers have long investigated end-user programming using a trigger-action (if-then) model, the website IFTTT is among the first instances of this paradigm being used on a large scale. To understand what IFTTT users are creating, we scraped the 224,590 programs shared publicly on IFTTT as of September 2015 and are releasing this dataset t...
Conference Paper
Full-text available
Successfully navigating the social world requires reasoning about both high-level strategic goals, such as whether to cooperate or compete, as well as the low-level actions needed to achieve those goals. While previous work in experimental game theory has examined the former and work on multi-agent systems has examined the later, there has been lit...
Article
Reinforcement learning is a branch of machine learning concerned with using experience gained through interacting with the world and evaluative feedback to improve a system's ability to make behavioural decisions. It has been called the artificial intelligence problem in a microcosm because learning algorithms must act autonomously to perform well...
Article
For real-world applications, virtual agents must be able to learn new behaviors from non-technical users. Positive and negative feedback are an intuitive way to train new behaviors, and existing work has presented algorithms for learning from such feedback. That work, however, treats feedback as numeric reward to be maximized, and assumes that all...
Article
We investigate the practicality of letting average users customize smart-home devices using trigger-action ("if, then") programming. We find trigger-action programming can express most desired behaviors submitted by participants in an online study. We identify a class of triggers requiring machine learning that has received little attention. We eva...
Article
Full-text available
Markov decision problems (MDPs) provide the foundations for a number of problems of interest to AI researchers studying automated planning and reinforcement learning. In this paper, we summarize results regarding the complexity of solving MDPs and the running time of MDP solution algorithms. We argue that, although MDPs can be solved efficiently in...
Article
Full-text available
Most exact algorithms for general partially observable Markov decision processes (POMDPs) use a form of dynamic programming in which a piecewise-linear and convex representation of one value function is transformed into another. We examine variations of the "incremental pruning" method for solving this problem and compare them to earlier algorithms...
Article
Full-text available
In this work, we introduce graphical modelsfor multi-player game theory, and give powerful algorithms for computing their Nash equilibria in certain cases. An n-player game is given by an undirected graph on n nodes and a set of n local matrices. The interpretation is that the payoff to player i is determined entirely by the actions of player i and...
Article
Full-text available
This paper considers the problem of learning task specific web-service descriptions from traces of users successfully completing a task. Unlike prior approaches, we take a traditional machine-learning perspective to the construction of web-service models from data. Our representation models both syntactic features of web-service schemas (including...
Article
Full-text available
Two-player complete-information game trees are perhaps the simplest possible setting for studying general-sum games and the computational problem of finding equilibria. These games admit a simple bottom-up algorithm for finding subgame perfect Nash equilibria efficiently. However, such an algorithm can fail to identify optimal equilibria, such as t...
Article
Full-text available
Model-based learning algorithms have been shown to use experience efficiently when learning to solve Markov Decision Processes (MDPs) with finite state and action spaces. However, their high computational cost due to repeatedly solving an internal model inhibits their use in large-scale problems. We propose a method based on real-time dynamic progr...
Article
Full-text available
Continuous state spaces and stochastic, switching dynamics characterize a number of rich, realworld domains, such as robot navigation across varying terrain. We describe a reinforcementlearning algorithm for learning in these domains and prove for certain environments the algorithm is probably approximately correct with a sample complexity that sca...
Article
Full-text available
We present a polynomial-time algorithm that always finds an (approximate) Nash equilibrium for repeated two-player stochastic games. The algorithm exploits the folk theorem to derive a strategy profile that forms an equilibrium by buttressing mutually beneficial behavior with threats, where possible. One component of our algorithm efficiently searc...
Conference Paper
Full-text available
This paper addresses the problem of training an artificial agent to follow verbal instructions representing high-level tasks using a set of instructions paired with demonstration traces of appropriate behavior. From this data, a mapping from instructions to tasks is learned, enabling the agent to carry out new instructions in novel environments.
Conference Paper
Full-text available
This article presents a population-based cognitive hierarchy model that can be used to estimate the reasoning depth and sophistication of a collection of opponents' strategies from observed behavior in repeated games. This framework provides a compact representation of a distribution of complicated strategies by reducing them to a small number of p...
Article
Full-text available
This paper presents a new algorithm for online linear regression whose efficiency guarantees satisfy the requirements of the KWIK (Knows What It Knows) framework. The algorithm improves on the complexity bounds of the current state-of-the-art procedure in this setting. We explore several applications of this algorithm for learning compact reinforce...
Article
Full-text available
We present a modular approach to reinforcement learning that uses a Bayesian representation of the uncertainty over models. The approach, BOSS (Best of Sampled Set), drives exploration by sampling multiple models from the posterior and selecting actions optimistically. It extends previous work by providing a rule for deciding when to resample and h...
Article
The Communications Web site, http://cacm.acm.org, features more than a dozen bloggers in the BLOG@CACM community. In each issue of Communications, we'll publish selected posts or excerpts.twitterFollow us on Twitter ...
Article
Full-text available
Bayes-optimal behavior, while well-defined, is often difficult to achieve. Recent advances in the use of Monte-Carlo tree search (MCTS) have shown that it is possible to act near-optimally in Markov Decision Processes (MDPs) with very large or infinite state spaces. Bayes-optimal behavior in an unknown MDP is equivalent to optimal behavior in the k...
Article
Full-text available
Finding a meaningful way of characterizing the difficulty of partially observable Markov decision processes (POMDPs) is a core theoretical problem in POMDP research. State-space size is often used as a proxy for POMDP difficulty, but it is a weak metric at best. Existing work has shown that the covering number for the reachable belief space, which...
Article
Full-text available
We show that the problem of finding an optimal stochastic 'blind' controller in a Markov decision process is an NP-hard problem. The corresponding decision problem is NP-hard, in PSPACE, and SQRT-SUM-hard, hence placing it in NP would imply breakthroughs in long-standing open problems in computer science. Our result establishes that the more genera...
Conference Paper
Full-text available
Although household devices and home appliances function more and more as network-connected computers, they don’t provide programming interfaces for the average user. We first identify the programming primitives and control structures necessary for the universal programming of devices. We then propose a mapping between the features necessary for the...
Article
Full-text available
The special issue of Machine Learning focused on introducing the empirical evaluations in reinforcement learning. It was demonstrated that reinforcement-learning approaches were evaluated in one or more of three ways, such as subjectively, theoretically, and empirically. Subjective evaluations, in which researchers assessed the significance and pot...
Article
The nodes in a wireless ad hoc network act as routers in a self-configuring network without infrastructure. An application running on the nodes in the ad hoc network may require that intermediate nodes act as routers, receiving and forwarding data packets to other nodes to overcome the limitations of noise, router congestion and limited transmissio...
Article
Full-text available
For the problem of model choice in linear regression, we introduce a Bayesian adaptive sampling algorithm (BAS), that samples models without replacement from the space of models. For problems that permit enumeration of all models, BAS is guaranteed to enumerate the model space in 2p iterations where p is the number of potential variables under cons...
Article
Full-text available
The First Trading Agent Competition (TAC) was held from June 22nd to July 8th, 2000. TAC was designed to create a benchmark problem in the complex domain of e-marketplaces and to motivate researchers to apply unique approaches to a common task. This article describes ATTac-2000, the first-place finisher in TAC. ATTac-2000 uses a principled bidding...
Article
Lexicographic preference models (LPMs) are an intuitive representation that corresponds to many real-world preferences exhibited by human decision makers. Previous algorithms for learning LPMs produce a “best guess” LPM that is consistent with the observations. Our approach is more democratic: we do not commit to a single LPM. Instead, we approxima...
Article
Full-text available
We introduce a learning framework that combines elements of the well-known PAC and mistake-bound models. The KWIK (knows what it knows) framework was designed particularly for its utility in learning settings where active exploration can impact the training examples the learner is exposed to, as is true in reinforcement-learning and active-learning...
Article
A good (a professional soft-serve ice cream maker, as it turns out) is being raffled off by an auctioneer. Bidders can buy tickets in the raffle for $1 apiece and at the end of the raffle the good is assigned to one of the bidders. The probability of assigning the good to a given bidder is proportional to the number of tickets he or she bought. Fiv...
Article
Most Relevant Explanation (MRE) is the problem of finding a partial instantiation of a set of target variables that maximizes the generalized Bayes factor as the explanation for given evidence in a Bayesian network. MRE has a huge solution space and is extremely difficult to solve in large Bayesian networks. In this paper, we first prove that MRE i...
Article
Full-text available
Designing the dialogue policy of a spoken dialogue system involves many nontrivial choices. This paper presents a reinforcement learning approach for automatically optimizing a dialogue policy, which addresses the technical challenges in applying reinforcement learning to a working dialogue system with human users. We report on the design, construc...
Article
Full-text available
The nodes in a wireless ad hoc network act as routers in a self-configuring network without infrastructure. An application running on the nodes in the ad hoc network may require that intermediate nodes act as routers, receiving and forwarding data packets to other nodes to overcome the limitations of noise, router congestion and limited transmissio...
Conference Paper
Full-text available
This paper examines how the choice of the selection mechanism in an evolutionary algorithm impacts the objective function it optimizes, specifically when the fitness function is noisy. We provide formal results showing that, in an abstract infinite-population model, proportional selection optimizes expected fitness, truncation selection optimizes o...
Conference Paper
Full-text available
In this paper, we present a new algorithm that integrates recent advances in solving continuous bandit problems with sample-based rollout methods for planning in Markov Decision Processes (MDPs). Our algorithm, Hierarchical Optimistic Optimization applied to Trees (HOOT) addresses planning in continuous-action MDPs. Empirical results are given that...
Conference Paper
Full-text available