Home
University of California, Berkeley
Department of Electrical Engineering and Computer Sciences
Pieter Abbeel

Pieter Abbeel
University of California, Berkeley | UCB · Department of Electrical Engineering and Computer Sciences

About

504

Publications

110,948

Reads

70,077

Citations

Publications

Managing extreme AI risks amid rapid progress

Article

May 2024

Preparation requires technical research and development, as well as adaptive, proactive governance

Self-Supervised Instance Segmentation by Grasping

Conference Paper

Oct 2023

Convolutional Occupancy Models for Dense Packing of Complex, Novel Objects

Conference Paper

Oct 2023

We evaluate the language-conditioned collision prediction accuracy of...

Ablation on Single and Multi-View Encoders.

The dataset size ablation examines how perfor- mance varies as we alter...

Language-Conditioned Path Planning

Preprint

Full-text available

Aug 2023

Contact is at the core of robotic manipulation. At times, it is desired (e.g. manipulation and grasping), and at times, it is harmful (e.g. when avoiding obstacles). However, traditional path planning algorithms focus solely on collision-free paths, limiting their applicability in contact-rich tasks. To address this limitation, we propose the domai...

Figure 3: Video-Language alignment scores from R3M [24], InternVideo...

Figure 4: We pretrain on domainrandomized environments based on Ego4D...

Figure 5: Finetuning performance on visual robotic manipulation tasks...

Figure 6: Finetuning performance on RLBench tasks. (a) Effect of...

Figure 8: Examples of ShapeNet object assets used during the...

Language Reward Modulation for Pretraining Reinforcement Learning

Preprint

Full-text available

Aug 2023

Using learned reward functions (LRFs) as a means to solve sparse-reward reinforcement learning (RL) tasks has yielded some steady progress in task-complexity through the years. In this work, we question whether today's LRFs are best-suited as a direct replacement for task rewards. Instead, we propose leveraging the capabilities of LRFs as a pretrai...

Fig. 5. Fitting a bounding box around a point cloud (black dots). We...

Fig. 6. (a) Our robot picks items from the cluttered bin in the front,...

Convolutional Occupancy Models for Dense Packing of Complex, Novel Objects

Preprint

Full-text available

Jul 2023

Dense packing in pick-and-place systems is an important feature in many warehouse and logistics applications. Prior work in this space has largely focused on planning algorithms in simulation, but real-world packing performance is often bottlenecked by the difficulty of perceiving 3D object geometry in highly occluded, partially observed scenes. In...

Learning to Model the World with Language

Preprint

Jul 2023

To interact with humans in the world, agents need to understand the diverse types of language that people use, relate them to the visual world, and act based on them. While current agents learn to execute simple language instructions from task rewards, we aim to build agents that leverage diverse language that conveys general knowledge, describes t...

Robust and Versatile Bipedal Jumping Control through Reinforcement Learning

Conference Paper

Jul 2023

SpawnNet: Learning Generalizable Visuomotor Skills from Pre-trained Networks

Preprint

Jul 2023

The existing internet-scale image and video datasets cover a wide range of everyday objects and tasks, bringing the potential of learning policies that have broad generalization. Prior works have explored visual pre-training with different self-supervised objectives, but the generalization capabilities of the learned policies remain relatively unkn...

Improving Long-Horizon Imitation through Instruction Prediction

Article

Jun 2023

Complex, long-horizon planning and its combinatorial nature pose steep challenges for learning-based agents. Difficulties in such settings are exacerbated in low data regimes where over-fitting stifles generalization and compounding errors hurt accuracy. In this work, we explore the use of an often unused source of auxiliary supervision: language....

Improving Long-Horizon Imitation Through Instruction Prediction

Preprint

Jun 2023

ALP: Action-Aware Embodied Learning for Perception

Preprint

Jun 2023

Current methods in training and benchmarking vision models exhibit an over-reliance on passive, curated datasets. Although models trained on these datasets have shown strong performance in a wide variety of tasks such as classification, detection, and segmentation, they fundamentally are unable to generalize to an ever-evolving world due to constan...

Figure 2: Video Adapter Framework. Video Adapter only requires training...

Figure 3: Instance Specific Stylization. Video Adapter enables the...

Figure 5: Video Adapter enables stylization of a SciFi Specific Model....

Probabilistic Adaptation of Text-to-Video Models

Preprint

Full-text available

Jun 2023

Large text-to-video models trained on internet-scale data have demonstrated exceptional capabilities in generating high-fidelity videos from arbitrary textual descriptions. However, adapting these models to tasks with limited domain-specific data, such as animation or robotics videos, poses a significant computational challenge, since finetuning a...

Train Offline, Test Online: A Real Robot Learning Benchmark

Preprint

Jun 2023

Three challenges limit the progress of robot learning research: robots are expensive (few labs can participate), everyone uses different robots (findings do not generalize across labs), and we lack internet-scale robotics data. We take on these challenges via a new benchmark: Train Offline, Test Online (TOTO). TOTO provides remote users with access...

VectorFusion: Text-to-SVG by Abstracting Pixel-Based Diffusion Models

Conference Paper

Jun 2023

Accelerating Reinforcement Learning with Value-Conditional State Entropy Exploration

Preprint

May 2023

A promising technique for exploration is to maximize the entropy of visited state distribution, i.e., state entropy, by encouraging uniform coverage of visited state space. While it has been effective for an unsupervised setup, it tends to struggle in a supervised setup with a task reward, where an agent prefers to visit high-value states to exploi...

StereoPose: Category-Level 6D Transparent Object Pose Estimation from Stereo Images via Back-View NOCS

Conference Paper

May 2023

Distributional Instance Segmentation: Modeling Uncertainty and High Confidence Predictions with Latent-MaskRCNN

Conference Paper

May 2023

Train Offline, Test Online: A Real Robot Learning Benchmark

Conference Paper

May 2023

DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models

Preprint

May 2023

Learning from human feedback has been shown to improve text-to-image models. These techniques first learn a reward function that captures what humans care about in the task and then improve the models based on the learned reward function. Even though relatively simple approaches (e.g., rejection sampling based on reward scores) have been investigat...

Figure 1: VIPER uses the next-token likelihoods of a frozen video...

Figure 4: Video model rollouts for 3 different evaluation environments.

Figure 4 shows example video model rollouts for each domain. In...

Figure 13: A single autoregressive video model is trained on 30 tasks...

Figure 14: A single task-conditioned autoregressive video model is...

Video Prediction Models as Rewards for Reinforcement Learning

Preprint

Full-text available

May 2023

Specifying reward signals that allow agents to learn complex behaviors is a long-standing challenge in reinforcement learning. A promising approach is to extract preferences for behaviors from unlabeled videos, which are widely available on the internet. We present Video Prediction Rewards (VIPER), an algorithm that leverages pretrained video predi...

Figure 7. MTM Representations enable faster learning. The plot...

Figure B.1. Continues Control Evaluation Settings.

Figure D.1. Effect of Trajectory Training Length. This plot depicts the...

Figure E.3. Finetuned and frozen MTM representations. Here we...

Masked Trajectory Models for Prediction, Representation, and Control

Preprint

Full-text available

May 2023

We introduce Masked Trajectory Models (MTM) as a generic abstraction for sequential decision making. MTM takes a trajectory, such as a state-action sequence, and aims to reconstruct the trajectory conditioned on random subsets of the same trajectory. By training with a highly randomized masking pattern, MTM learns versatile networks that can take o...

Fig. 2: A full example set of F1 scores on all tasks in...

Fig. 3: From left to right, (a): 24 DoF, under-actuated right Shadow...

Fig. 8: Performance of DroQ vs PPO on ROBOPIANIST-prelude.

RoboPianist: A Benchmark for High-Dimensional Robot Control

Preprint

Full-text available

Apr 2023

We introduce a new benchmarking suite for high-dimensional control, targeted at testing high spatial and temporal precision, coordination, and planning, all with an underactuated system frequently making-and-breaking contacts. The proposed challenge is mastering the piano through bi-manual dexterity, using a pair of simulated anthropomorphic robot...

Figure 3. Rank distribution per model. For every model, we compute the...

Fig. 5. Sample visualization of the input image pair (leï¿¿), our...

Figure 5. Comparison of VC-1 with existing PVRs. VC-1 matches or...

Figure 6. Rank distribution per model -scaling hypothesis.

Figure 7. Rank distribution per model -existing PVRs and scaling...

Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

Preprint

Full-text available

Mar 2023

We present the largest and most comprehensive empirical study of pre-trained visual representations (PVRs) or visual 'foundation models' for Embodied AI. First, we curate CortexBench, consisting of 17 different tasks spanning locomotion, navigation, dexterous, and mobile manipulation. Next, we systematically evaluate existing PVRs and find that non...

Fig. 4. Illustrations of different representation learning objectives...

Foundation Models for Decision Making: Problems, Methods, and Opportunities

Preprint

Full-text available

Mar 2023

Foundation models pretrained on diverse data at scale have demonstrated extraordinary capabilities in a wide range of vision and language tasks. When such models are deployed in real world environments, they inevitably interface with other entities and agents. For example, language models are often used to interact with human beings through dialogu...

Preference Transformer: Modeling Human Preferences using Transformers for RL

Preprint

Mar 2023

Preference-based reinforcement learning (RL) provides a framework to train agents using human preferences between two behaviors. However, preference-based RL has been challenging to scale since it requires a large amount of human feedback to learn a reward function aligned with human intent. In this paper, we present Preference Transformer, a neura...

Aligning Text-to-Image Models using Human Feedback

Preprint

Feb 2023

Deep generative models have shown impressive results in text-to-image synthesis. However, current text-to-image models often generate images that are inadequately aligned with text prompts. We propose a fine-tuning method for aligning such models using human feedback, comprising three stages. First, we collect human feedback assessing model output...

Robust and Versatile Bipedal Jumping Control through Multi-Task Reinforcement Learning

Preprint

Feb 2023

This work aims to push the limits of agility for bipedal robots by enabling a torque-controlled bipedal robot to perform robust and versatile dynamic jumps in the real world. We present a multi-task reinforcement learning framework to train the robot to accomplish a large variety of jumping tasks, such as jumping to different locations and directio...

Controllability-Aware Unsupervised Skill Discovery

Preprint

Feb 2023

One of the key capabilities of intelligent agents is the ability to discover useful skills without external supervision. However, the current unsupervised skill discovery methods are often limited to acquiring simple, easy-to-learn skills due to the lack of incentives to discover more complex, challenging behaviors. We introduce a novel unsupervise...

The Wisdom of Hindsight Makes Language Models Better Instruction Followers

Preprint

Feb 2023

Reinforcement learning has seen wide success in finetuning large language models to better align with instructions via human feedback. The so-called algorithm, Reinforcement Learning with Human Feedback (RLHF) demonstrates impressive performance on the GPT series models. However, the underlying Reinforcement Learning (RL) algorithm is complex and r...

Multi-View Masked World Models for Visual Robotic Manipulation

Preprint

Feb 2023

Visual robotic manipulation research and applications often use multiple cameras, or views, to better perceive the world. How else can we utilize the richness of multi-view data? In this paper, we investigate how to learn good representations with multi-view data and utilize them for visual robotic manipulation. Specifically, we train a multi-view...

Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment

Preprint

Feb 2023

Recent progress in scaling up large language models has shown impressive capabilities in performing few-shot learning across a wide range of text-based tasks. However, a key limitation is that these language models fundamentally lack visual perception - a crucial attribute needed to extend these models to be able to interact with the real world and...

Figure 12. High Fidelity Plan Generation. Additional results on UniPi's...

Task Completion Accuracy on Combinatorial Environment. UniPi...

Learning Universal Policies via Text-Guided Video Generation

Preprint

Full-text available

Jan 2023

A goal of artificial intelligence is to construct an agent that can solve a wide variety of tasks. Recent progress in text-guided image synthesis has yielded models with an impressive ability to generate complex novel images, exhibiting combinatorial generalization across domains. Motivated by this success, we investigate whether such tools can be...

Masked Autoencoding for Scalable and Generalizable Decision Making

Preprint

Nov 2022

We are interested in learning scalable agents for reinforcement learning that can learn from large-scale, diverse sequential data similar to current large vision and language models. To this end, this paper presents masked decision prediction (MaskDP), a simple and scalable self-supervised pretraining method for reinforcement learning (RL) and beha...

Figure 1: The dynamics model pretraining procedure of ALPT using the...

Figure 2: Game performance across the ALE environments for the baseline...

Figure 5: (a) An example diagram of the Blocked (above) and Tunneled...

The final evaluation game performance after training CQL for 100...

Multi-Environment Pretraining Enables Transfer to Action Limited Datasets

Preprint

Full-text available

Nov 2022

Using massive datasets to train large-scale models has emerged as a dominant approach for broad generalization in natural language and vision applications. In reinforcement learning, however, a key challenge is that available data of sequential decision making is often not annotated with actions - for example, videos of game-play are much more avai...

VectorFusion: Text-to-SVG by Abstracting Pixel-Based Diffusion Models

Preprint

Full-text available

Nov 2022

Diffusion models have shown impressive results in text-to-image synthesis. Using massive datasets of captioned images, diffusion models learn to generate raster images of highly diverse objects and scenes. However, designers frequently use vector representations of images like Scalable Vector Graphics (SVGs) for digital icons or art. Vector graphic...

StereoPose: Category-Level 6D Transparent Object Pose Estimation from Stereo Images via Back-View NOCS

Preprint

Nov 2022

Most existing methods for category-level pose estimation rely on object point clouds. However, when considering transparent objects, depth cameras are usually not able to capture meaningful data, resulting in point clouds with severe artifacts. Without a high-quality point cloud, existing methods are not applicable to challenging transparent object...

Figure 2: To bridge the visual sim-to-real gap, we apply combinations...

Figure 3: Polaris hardware with sensor suite. a) raw image, (b) cropped...

Figure 4: Real-world off-road evaluation data gathered by our platform...

Figure 5: Timelapse of experimental demonstration of zero-shot transfer...

Figure 7: Sample rollouts from each of the considered action modes in...

Sim-to-Real via Sim-to-Seg: End-to-end Off-road Autonomous Driving Without Real Data

Preprint

Full-text available

Oct 2022

Autonomous driving is complex, requiring sophisticated 3D scene understanding, localization, mapping, and control. Rather than explicitly modelling and fusing each of these components, we instead consider an end-to-end approach via reinforcement learning (RL). However, collecting exploration driving data in the real world is impractical and dangero...

FCM: Forgetful Causal Masking Makes Causal Language Models Better Zero-Shot Learners

Preprint

Oct 2022

Large language models (LLM) trained using the next-token-prediction objective, such as GPT3 and PaLM, have revolutionized natural language processing in recent years by showing impressive zero-shot and few-shot capabilities across a wide range of tasks. In this work, we propose a simple technique that significantly boosts the performance of LLMs wi...

Instruction-Following Agents with Jointly Pre-Trained Vision-Language Models

Preprint

Oct 2022

Humans are excellent at understanding language and vision to accomplish a wide range of tasks. In contrast, creating general instruction-following embodied agents remains a difficult challenge. Prior work that uses pure language-only models lack visual grounding, making it difficult to connect language instructions with visual observations. On the...

Figure 2: [Left] Bernoulli bandit where the better arm a 1 with reward...

Figure 3: [Left] Visualization of the stochastic FrozenLake task. The...

Figure 4: Average performance (across 5 seeds) of DoC and baselines on...

Figure 5: Deterministic environment used in the counter-example...

Dichotomy of Control: Separating What You Can Control from What You Cannot

Preprint

Full-text available

Oct 2022

Future- or return-conditioned supervised learning is an emerging paradigm for offline reinforcement learning (RL), where the future outcome (i.e., return) associated with an observed action sequence is used as input to a policy trained to imitate those same actions. While return-conditioning is at the heart of popular algorithms such as decision tr...

Spending Thinking Time Wisely: Accelerating MCTS with Virtual Expansions

Preprint

Oct 2022

One of the most important AI research questions is to trade off computation versus performance since ``perfect rationality" exists in theory but is impossible to achieve in practice. Recently, Monte-Carlo tree search (MCTS) has attracted considerable attention due to the significant performance improvement in various challenging domains. However, t...

Adversarial Motion Priors Make Good Substitutes for Complex Reward Functions

Conference Paper

Oct 2022

Learning Visual Robotic Control Efficiently with Contrastive Pre-training and Data Augmentation

Conference Paper

Oct 2022

Playful Interactions for Representation Learning

Conference Paper

Oct 2022

Multi-Objective Policy Gradients with Topological Constraints

Conference Paper

Oct 2022

CLUTR: Curriculum Learning via Unsupervised Task Representation Learning

Preprint

Oct 2022

Reinforcement Learning (RL) algorithms are often known for sample inefficiency and difficult generalization. Recently, Unsupervised Environment Design (UED) emerged as a new paradigm for zero-shot generalization by simultaneously learning a task distribution and agent policies on the sampled tasks. This is a non-stationary process where the task di...

HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator

Conference Paper

Oct 2022

Figure 10: Data collection. We show our setup for collecting...

Real-World Robot Learning with Masked Visual Pre-training

Preprint

Full-text available

Oct 2022

In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module. Unlike prior work, we show that the pre-trained representations are...

Figure C.1 and Figure C.2 show plots comparing performance with...

Temporally Consistent Video Transformer for Long-Term Video Prediction

Preprint

Full-text available

Oct 2022

Generating long, temporally consistent video remains an open challenge in video generation. Primarily due to computational limitations, most prior methods limit themselves to training on a small subset of frames that are then extended to generate longer videos through a sliding window fashion. Although these techniques may produce sharp videos, the...

Fig. 1. Example TMDP objective V i nodes and edges with slacks δ ij .

Multi-Objective Policy Gradients with Topological Constraints

Preprint

Full-text available

Sep 2022

Multi-objective optimization models that encode ordered sequential constraints provide a solution to model various challenging problems including encoding preferences, modeling a curriculum, and enforcing measures of safety. A recently developed theory of topological Markov decision processes (TMDPs) captures this range of problems for the case of...

HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator

Preprint

Sep 2022

Video prediction is an important yet challenging problem; burdened with the tasks of generating future frames and learning environment dynamics. Recently, autoregressive latent video models have proved to be a powerful video prediction tool, by separating the video prediction into two sub-problems: pre-training an image generator model, followed by...

Reducing Variance in Temporal-Difference Value Estimation via Ensemble of Deep Networks

Preprint

Sep 2022

In temporal-difference reinforcement learning algorithms, variance in value estimation can cause instability and overestimation of the maximal target value. Many algorithms have been proposed to reduce overestimation, including several recent ensemble methods, however none have shown success in sample-efficient learning through addressing estimatio...

Figure 3: 1-D Toy Density Modeling: ADACAT optimized with the...

Figure 4: Test bits per dimension (bpd) on MNIST image generation task...

Hyperparameters of ADACAT for UCI Datasets. H is the number of hidden...

AdaCat: Adaptive Categorical Discretization for Autoregressive Models

Preprint

Full-text available

Aug 2022

Autoregressive generative models can estimate complex continuous data distributions, like trajectory rollouts in an RL environment, image intensities, and audio. Most state-of-the-art models discretize continuous data into several bins and use categorical distributions over the bins to approximate the continuous data distribution. The advantage is...

Figure 8: Within 10 minutes of perturbing the learned walking behavior,...

DayDreamer: World Models for Physical Robot Learning

Preprint

Full-text available

Jun 2022

To solve tasks in complex environments, robots need to learn from experience. Deep reinforcement learning is a common approach to robot learning but requires a large amount of trial and error to learn, limiting its deployment in the physical world. As a consequence, many advances in robot learning rely on simulators. On the other hand, learning ins...

Masked World Models for Visual Control

Preprint

Jun 2022

Visual model-based reinforcement learning (RL) has the potential to enable sample-efficient robot learning from visual observations. Yet the current approaches typically train a single model end-to-end for learning both visual representations and dynamics, making it difficult to accurately model the interaction between robots and small objects. In...

Frozen Pretrained Transformers as Universal Computation Engines

Article

Full-text available

Jun 2022

We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning -- in particular, without finetuning of the self-attention and feedforward layers of the residual blocks. We consider such a model, which we call a Frozen Pretrained Transformer (FPT), and study finetuning it on a...

Programmatic Modeling and Generation of Real-Time Strategic Soccer Environments for Reinforcement Learning

Article

Jun 2022

The capability of a reinforcement learning (RL) agent heavily depends on the diversity of the learning scenarios generated by the environment. Generation of diverse realistic scenarios is challenging for real-time strategy (RTS) environments. The RTS environments are characterized by intelligent entities/non-RL agents cooperating and competing with...

Figure 2: Object-centric attention block for POVT. A base latent stream...

Figure 3: Diversity of video prediction results for each model. For...

Figure 4: One object (in red) is translated to a random pixel. Note...

Quantitative evaluation of single-frame video prediction on...

Ablation removing each attention component Method FVD↓ PSNR↑ SSIM↑ LPIPS↓

Patch-based Object-centric Transformers for Efficient Video Generation

Preprint

Full-text available

Jun 2022

In this work, we present Patch-based Object-centric Video Transformer (POVT), a novel region-based video generation architecture that leverages object-centric information to efficiently model temporal dynamics in videos. We build upon prior work in video prediction via an autoregressive transformer over the discrete latent space of compressed video...

Deep Hierarchical Planning from Pixels

Preprint

Jun 2022

Intelligent agents need to select long sequences of actions to solve complex tasks. While humans easily break down tasks into subgoals and reach them through millions of muscle commands, current artificial intelligence is limited to tasks with horizons of a few hundred decisions, despite large compute budgets. Research on hierarchical reinforcement...

Figure 4: pretrained Procgen(Coinrun) agents on unseen levels. Each...

Figure 8: Adaptation on unseen Atari games. Solid lines are average...

On the Effectiveness of Fine-tuning Versus Meta-reinforcement Learning

Preprint

Full-text available

Jun 2022

Intelligent agents should have the ability to leverage knowledge from previously learned tasks in order to learn new ones quickly and efficiently. Meta-learning approaches have emerged as a popular solution to achieve this. However, meta-reinforcement learning (meta-RL) algorithms have thus far been restricted to simple environments with narrow tas...

Figure 1: Multimodal masked autoencoder (M3AE) consists of an encoder...

Figure 2: M3AE can learn representations from a flexible mixture of...

Hyperparameters for linear classification on ImageNet 1K

Multimodal Masked Autoencoders Learn Transferable Representations

Preprint

Full-text available

May 2022

Building scalable models to learn from diverse, multimodal data remains an open challenge. For vision-language data, the dominant approaches are based on contrastive learning objectives that train a separate encoder for each modality. While effective, contrastive learning approaches introduce sampling bias depending on the data augmentations used,...

Towards More Generalizable One-shot Visual Imitation Learning

Conference Paper

May 2022

Figure 2: Graphical models of vanilla BC, auxiliary BC, and procedure...

Figure 8: Example maze layouts with increasing maze size.

Figure 9: Average success rate of PC, BC (variants), and VIN navigating...

Chain of Thought Imitation with Procedure Cloning

Preprint

Full-text available

May 2022

Imitation learning aims to extract high-performance policies from logged demonstrations of expert behavior. It is common to frame imitation learning as a supervised learning problem in which one fits a function approximator to the input-output mapping exhibited by the logged demonstrations (input observations to output actions). While the framing o...

An Empirical Investigation of Representation Learning for Imitation

Preprint

Full-text available

May 2022

Imitation learning often needs a large demonstration set in order to handle the full range of situations that an agent might find itself in during deployment. However, collecting expert demonstrations can be expensive. Recent work in vision, reinforcement learning, and NLP has shown that auxiliary representation learning objectives can reduce the n...

Fig. 2: Example of "coarse ambiguity". At the coarsest levels of the...

Fig. 5: Learning curves for an additional 4 RLBench tasks. Both methods...

Fig. 6: Ablation of the tree expansion k parameter on 2 RLBench tasks....

Fig. 7: Ablation to assess how necessary using tree expansion is for...

Fig. 8: Real world qualitative task. Right: The goal is to reach the...

Coarse-to-fine Q-attention with Tree Expansion

Preprint

Full-text available

Apr 2022

Coarse-to-fine Q-attention enables sample-efficient robot manipulation by discretizing the translation space in a coarse-to-fine manner, where the resolution gradually increases at each layer in the hierarchy. Although effective, Q-attention suffers from "coarse ambiguity" - when voxelization is significantly coarse, it is not feasible to distingui...

Sim-to-Real 6D Object Pose Estimation via Iterative Self-training for Robotic Bin-picking

Preprint

Full-text available

Apr 2022

In this paper, we propose an iterative self-training framework for sim-to-real 6D object pose estimation to facilitate cost-effective robotic grasping. Given a bin-picking scenario, we establish a photo-realistic simulator to synthesize abundant virtual data, and use this to train an initial pose estimation network. This network then takes the role...

Fig. 5: Distribution of predicted rewards for trained PWIL policy vs....

Imitating, Fast and Slow: Robust learning from demonstrations via decision-time planning

Preprint

Full-text available

Apr 2022

The goal of imitation learning is to mimic expert behavior from demonstrations, without access to an explicit reward signal. A popular class of approach infers the (unknown) reward function via inverse reinforcement learning (IRL) followed by maximizing this reward function via reinforcement learning (RL). The policies learned via these approaches...

Fig. 2: In simulation, our method is evaluated on a total of 16 RLBench...

Fig. 3: Learning curves for 8 RLBench tasks. Both methods only receive...

Fig. 4: Learning curves for an additional 8 RLBench tasks. Both methods...

Fig. 5: Ablation of the learned path policy on a set of 4 RLBench...

Fig. 6: How often the path ranking function predicts the learned policy...

Coarse-to-Fine Q-attention with Learned Path Ranking

Preprint

Full-text available

Apr 2022

We propose Learned Path Ranking (LPR), a method that accepts an end-effector goal pose, and learns to rank a set of goal-reaching paths generated from an array of path generating methods, including: path planning, Bezier curve sampling, and a learned policy. The core idea being that each of the path generation modules will be useful in different ta...

Pretraining Graph Neural Networks for few-shot Analog Circuit Modeling and Design

Preprint

Mar 2022

Being able to predict the performance of circuits without running expensive simulations is a desired capability that can catalyze automated design. In this paper, we present a supervised pretraining approach to learn circuit representations that can be adapted to new circuit topologies or unseen prediction tasks. We hypothesize that if we train a n...

Fig. 3. An agent trained with Adversarial Motion Priors extracts the...

Adversarial Motion Priors Make Good Substitutes for Complex Reward Functions

Preprint

Full-text available

Mar 2022

Training a high-dimensional simulated agent with an under-specified reward function often leads the agent to learn physically infeasible strategies that are ineffective when deployed in the real world. To mitigate these unnatural behaviors, reinforcement learning practitioners often utilize complex reward functions that encourage physically plausib...

Reinforcement Learning with Action-Free Pre-Training from Videos

Preprint

Mar 2022

Recent unsupervised pre-training methods have shown to be effective on language and vision domains by learning useful representations for multiple downstream tasks. In this paper, we investigate if such unsupervised pre-training methods can also be effective for vision-based reinforcement learning (RL). To this end, we introduce a framework that le...

SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning

Preprint

Mar 2022

Preference-based reinforcement learning (RL) has shown potential for teaching agents to perform the target tasks without a costly, pre-defined reward function by learning the reward with a supervisor's preference between the two agent behaviors. However, preference-based learning often requires a large amount of human feedback, making it difficult...

Figure 3: Evaluation Domains. (Left) BabyAI (Middle) Point Maze...

Figure 7: All curves show the success rate of an advice-free policy...

Figure 17: Architecture diagram modified from BabyAI 1.1 [19]. For the...

Figure 18: Performance on the improvement phase in the Point Maze...

Teachable Reinforcement Learning via Advice Distillation

Preprint

Full-text available

Mar 2022

Training automated agents to complete complex tasks in interactive environments is challenging: reinforcement learning requires careful hand-engineering of reward functions, imitation learning requires specialized infrastructure and access to a human expert, and learning from intermediate forms of supervision (like binary preferences) is time-consu...

Figure 4: Illustration of the five environments, G id showing training...

Figure 8: Progressive goal generation plots for Point Mass Obstacle...

Figure 9: Full ablations of all components of CuSP from Table 1....

Figure 10: Goal plots after 2000 rounds for Toss. The left plot shows...

Figure 12: Comparison of CuSP with separately initialized and updated...

It Takes Four to Tango: Multiagent Selfplay for Automatic Curriculum Generation

Preprint

Full-text available

Feb 2022

We are interested in training general-purpose reinforcement learning agents that can solve a wide variety of goals. Training such agents efficiently requires automatic generation of a goal curriculum. This is challenging as it requires (a) exploring goals of increasing difficulty, while ensuring that the agent (b) is exposed to a diverse set of goa...

Fig. 2: Illustrative example of how the Bingham distribution changes as...

Fig. 3: Wahba environment results PPO and SAC, with and without the...

Fig. 5: RLBench environment (full state and shaped rewards) results for...

Fig. 6: RLBench environment with high-dimensional state (RGB & point...

Bingham Policy Parameterization for 3D Rotations in Reinforcement Learning

Preprint

Full-text available

Feb 2022

We propose a new policy parameterization for representing 3D rotations during reinforcement learning. Today in the continuous control reinforcement learning literature, many stochastic policy parameterizations are Gaussian. We argue that universally applying a Gaussian policy parameterization is not always desirable for all environments. One such c...

Don't Change the Algorithm, Change the Data: Exploratory Data for Offline Reinforcement Learning

Preprint

Jan 2022

Recent progress in deep learning has relied on access to large and diverse datasets. Such data-driven progress has been less evident in offline reinforcement learning (RL), because offline RL data is usually collected to optimize specific target tasks limiting the data's diversity. In this work, we propose Exploratory data for Offline RL (ExORL), a...

Figure 1. This work deals with unsupervised skill discovery through...

Prior Competence-based Unsupervised Skill Discovery Algorithms

CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery

Preprint

Full-text available

Jan 2022

We introduce Contrastive Intrinsic Control (CIC), an algorithm for unsupervised skill discovery that maximizes the mutual information between skills and state transitions. In contrast to most prior approaches, CIC uses a decomposition of the mutual information that explicitly incentivizes diverse behaviors by maximizing state entropy. We derive a n...

Figure 1. Our method seeks out more diverse states from which to show...

Figure 5. Experimental setup for the counterfactual states user study....

Minigrid Policy Understanding and Evaluation

Explaining Reinforcement Learning Policies through Counterfactual Trajectories

Preprint

Full-text available

Jan 2022

In order for humans to confidently decide where to employ RL agents for real-world tasks, a human developer must validate that the agent will perform well at test-time. Some policy interpretability methods facilitate this by capturing the policy's decision making in a set of agent rollouts. However, even the most informative trajectories of trainin...

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Preprint

Jan 2022

Can world knowledge learned by large language models (LLMs) be used to act in interactive environments? In this paper, we investigate the possibility of grounding high-level tasks, expressed in natural language (e.g. "make breakfast"), to a chosen set of actionable steps (e.g. "open fridge"). While prior work focused on learning from explicit step-...

Pretraining Graph Neural Networks for Few-Shot Analog Circuit Modeling and Design

Article

Jan 2022

Being able to predict the performance of circuits without running expensive simulations is a desired capability that can catalyze automated design. In this article, we present a supervised pretraining approach to learn circuit representations that can be adapted to new circuit topologies or unseen prediction tasks. We hypothesize that if we train a...

Target Entropy Annealing for Discrete Soft Actor-Critic

Preprint

Dec 2021

Soft Actor-Critic (SAC) is considered the state-of-the-art algorithm in continuous action space settings. It uses the maximum entropy framework for efficiency and stability, and applies a heuristic temperature Lagrange term to tune the temperature $\alpha$, which determines how "soft" the policy should be. It is counter-intuitive that empirical evi...

Figure 2. Example Dream Fields rendered from four perspectives. On the...

Figure 4. To encourage coherent foreground objects, Dream Fields train...

Figure 8. Training with diversely sampled camera poses improves...

Figure 9. Visualizing the total loss with different sparsity...

Zero-Shot Text-Guided Object Generation with Dream Fields

Preprint

Full-text available

Dec 2021

We combine neural rendering with multi-modal image and text representations to synthesize diverse 3D objects solely from natural language descriptions. Our method, Dream Fields, can generate the geometry and color of a wide range of objects without 3D supervision. Due to the scarcity of diverse, captioned 3D data, prior methods only generate object...

Hindsight Task Relabelling: Experience Replay for Sparse Reward Meta-RL

Preprint

Dec 2021

Meta-reinforcement learning (meta-RL) has proven to be a successful framework for leveraging experience from prior tasks to rapidly learn new related tasks, however, current meta-RL approaches struggle to learn in sparse reward environments. Although existing meta-RL algorithms can learn strategies for adapting to new sparse reward tasks, the actua...

Figure 3: CBSQL results compared with DQN and fixed-temperature SQL,...

Hyper-parameters for tabular experiments.

Hyper-parameters for DQN, SQL and CBSQL on Atari 2600. The values of...

Count-Based Temperature Scheduling for Maximum Entropy Reinforcement Learning

Preprint

Full-text available

Nov 2021

Maximum Entropy Reinforcement Learning (MaxEnt RL) algorithms such as Soft Q-Learning (SQL) and Soft Actor-Critic trade off reward and policy entropy, which has the potential to improve training stability and robustness. Most MaxEnt RL methods, however, use a constant tradeoff coefficient (temperature), contrary to the intuition that the temperatur...

Explaining robot policies

Article

Full-text available

Nov 2021

In order to interact with a robot or make wise decisions about where and how to deploy it in the real world, humans need to have an accurate mental model of how the robot acts in different situations. We propose to improve users’ mental model of a robot by showing them examples of how the robot behaves in informative scenarios. We explore this in t...

B-Pref: Benchmarking Preference-Based Reinforcement Learning

Preprint

Nov 2021

Reinforcement learning (RL) requires access to a reward function that incentivizes the right behavior, but these are notoriously hard to specify for complex tasks. Preference-based RL provides an alternative: learning policies using a teacher's preferences without pre-defined rewards, thus overcoming concerns associated with reward engineering. How...

Generalization in Dexterous Manipulation via Geometry-Aware Multi-Task Learning

Preprint

Nov 2021

Dexterous manipulation of arbitrary objects, a fundamental daily task for humans, has been a grand challenge for autonomous robotic systems. Although data-driven approaches using reinforcement learning can develop specialist policies that discover behaviors to control a single object, they often exhibit poor generalization to unseen ones. In this w...

Mastering Atari Games with Limited Data

Preprint

Oct 2021

Reinforcement learning has achieved great success in many applications. However, sample efficiency remains a key challenge, with prominent methods requiring millions (or even billions) of environment steps to train. Recently, there has been significant progress in sample efficient image-based RL algorithms; however, consistent human-level performan...

Figure 6: Individual results of fine-tuning efficiency as a function of...

Figure 7: Individual results of fine-tuning efficiency as a function of...

Figure 8: Finetuning curves for each evaluated unsupervised algorithm...

URLB: Unsupervised Reinforcement Learning Benchmark

Preprint

Full-text available

Oct 2021

Deep Reinforcement Learning (RL) has emerged as a powerful paradigm to solve a range of complex yet specific control tasks. Yet training generalist agents that can quickly adapt to new tasks remains an outstanding challenge. Recent advances in unsupervised RL have shown that pre-training RL agents with self-supervised intrinsic rewards can result i...

Figure 1: (a) Optimal state value in the grid world domain. Walls...

Figure 2: (a) True value V * (s) and optimal policy π * (s) of each...

Figure 3: Performance score achieved after 500k interactions, averaged...

Figure 4: First row: Average score over 5 runs. Second row: Initial...

Figure 6: First row: Average score of 1 life over 5 runs, experience is...

Temporal-Difference Value Estimation via Uncertainty-Guided Soft Updates

Preprint

Full-text available

Oct 2021

Temporal-Difference (TD) learning methods, such as Q-Learning, have proven effective at learning a policy to perform control tasks. One issue with methods like Q-Learning is that the value update introduces bias when predicting the TD target of a unfamiliar state. Estimation noise becomes a bias after the max operator in the policy improvement step...

Fig. 12: Illustration of self-attention blocks used in our model (left)...

Towards More Generalizable One-shot Visual Imitation Learning

Preprint

Full-text available

Oct 2021

A general-purpose robot should be able to master a wide range of tasks and quickly learn a novel one by leveraging past experiences. One-shot imitation learning (OSIL) approaches this goal by training an agent with (pairs of) expert demonstrations, such that at test time, it can directly execute a new task from just one demonstration. However, so f...

Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis

Conference Paper

Oct 2021

APS: Active Pretraining with Successor Features

Preprint

Aug 2021

We introduce a new unsupervised pretraining objective for reinforcement learning. During the unsupervised reward-free pretraining phase, the agent maximizes mutual information between tasks and states induced by the policy. Our key contribution is a novel lower bound of this intractable quantity. We show that by reinterpreting and combining variati...

Figure 4: SkiP and baselines (Sec. 4) evaluated over six tasks in the...

Figure 6: SkiP with preferences vs SkiP with learned sparse reward....

Figure 7: The plot compares SkiP with different segment size over the...

Skill Preferences: Learning to Extract and Execute Robotic Skills from Human Feedback

Preprint

Full-text available

Aug 2021

A promising approach to solving challenging long-horizon tasks has been to extract behavior priors (skills) by fitting generative models to large offline datasets of demonstrations. However, such generative models inherit the biases of the underlying data and result in poor and unusable skills when trained on imperfect demonstration data. To better...

AMP: adversarial motion priors for stylized physics-based character control

Article

Aug 2021

AMP: adversarial motion priors for stylized physics-based character control

Article

Aug 2021

Synthesizing graceful and life-like behaviors for physically simulated characters has been a fundamental challenge in computer animation. Data-driven methods that leverage motion tracking are a prominent class of techniques for producing high fidelity motions for a wide range of behaviors. However, the effectiveness of these tracking-based methods...

Figure 4: The setup of our approach consists of a self-supervised...

Comparison of Amount of Expert Demonstration Data

Playful Interactions for Representation Learning

Preprint

Full-text available

Jul 2021

One of the key challenges in visual imitation learning is collecting large amounts of expert demonstrations for a given task. While methods for collecting human demonstrations are becoming easier with teleoperation methods and the use of low-cost assistive tools, we often still require 100-1000 demonstrations for every task to learn a visual repres...

Figure 2: Our algorithm -Few-Shot Imitation Learning with Skill...

Figure 3: Top: In each environment, we block some part of the...

Figure 4: Normalized Reward on all of our environments, and their...

Hierarchical Few-Shot Imitation with Skill Transition Models

Preprint

Full-text available

Jul 2021

A desirable property of autonomous agents is the ability to both solve long-horizon problems and generalize to unseen tasks. Recent advances in data-driven skill learning have shown that extracting behavioral priors from offline data can enable agents to solve challenging long-horizon tasks with reinforcement learning. However, generalization to ta...

The MineRL BASALT Competition on Learning from Human Feedback

Preprint

Full-text available

Jul 2021

The last decade has seen a significant increase of interest in deep learning research, with many public successes that have demonstrated its potential. As such, these systems are now being incorporated into commercial products. With this comes an additional challenge: how can we build AI systems that solve tasks where there is not a crisp, well-def...