Dieter Fox

Dieter Fox
University of Washington Seattle | UW · Department of Computer Science and Engineering

About

300
Publications
123,240
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
50,614
Citations

Publications

Publications (300)
Article
3D scene understanding is important for robots to interact with the 3D world in a meaningful way. Most previous works on 3D scene understanding focus on recognizing geometrical or semantic properties of the scene independently. In this work, we introduce Data Associated Recurrent Neural Networks (DA-RNNs), a novel framework for joint 3D scene mappi...
Article
Liquids are an important part of many common manipulation tasks in human environments. If we wish to have robots that can accomplish these types of tasks, they must be able to interact with liquids in an intelligent manner. In this paper, we investigate ways for robots to perceive and reason about liquids. That is, the robots ask the questions What...
Chapter
RGB-D cameras provide both a color image and per-pixel depth estimates. The richness of their data and the recent development of low-cost sensors have combined to present an attractive opportunity for mobile robotics research. In this paper, we describe a system for visual odometry and mapping using an RGB-D camera, and its application to autonomou...
Conference Paper
Full-text available
Inverse Reinforcement Learning (IRL) has been studied for more that 15 years and is of fundamental importance in robotics. It allows learning a utility function ``explaining'' the behavior of an agent, and can thus be used for imitation or prediction of a given behavior by having solely access to demonstrated optimal or near optimal solutions. In t...
Article
Robust estimation of correspondences between image pixels is an important problem in robotics, with applications in tracking, mapping, and recognition of objects, environments, and other agents. Correspondence estimation has long been the domain of hand-engineered features, but more recently deep learning techniques have provided powerful tools for...
Poster
Full-text available
In the recent years, a lot of interest has been focused on using Deep Convolutional Neural Networks (ConvNets) to encode cost functions or directly control policies. Using such powerful non-linear function approximators allows to learn from low level features directly, thus not requiring domain knowledge, which can potentially lead to learn higher...
Article
We introduce SE3-Nets, which are deep networks designed to model rigid body motion from raw point cloud data. Based only on pairs of depth images along with an action vector and point wise data associations, SE3-Nets learn to segment effected object parts and predict their motion resulting from the applied force. Rather than learning point wise flo...
Article
A fundamental challenge in robotics today is building robots that can learn new skills by observing humans and imitating human actions. We propose a new Bayesian approach to robotic learning by imitation inspired by the developmental hypothesis that children use self-experience to bootstrap the process of intention recognition and goal-based imitat...
Conference Paper
Full-text available
Advances in mobile robotics have enabled robotsthat can autonomously operate in human-populated environments. Although primary tasks for such robots might be fetching, delivery, or escorting, they present an untapped potentialas information gathering agents that can answer questions forthe community of co-inhabitants. In this paper, we seek tobette...
Article
This paper introduces DART, a general framework for tracking articulated objects composed of rigid bodies connected through a kinematic tree. DART covers a broad set of objects encountered in indoor environments, including furniture and tools, and human and robot bodies, hands and manipulators. To achieve efficient and robust tracking, DART extends...
Conference Paper
This work integrates visual and physical constraints to perform real-time depth-only tracking of articulated models, with a focus on tracking a robot's manipulators and the objects they interact with in realistic scenarios. As such, we modify an existing dense visual articulated object tracker to additionally avoid interpenetration of multiple inte...
Article
Full-text available
Hierarchies of concepts are useful in many applications from navigation to organization of objects. Usually, a hierarchy is created in a centralized manner by employing a group of domain experts, a time-consuming and expensive process. The experts often design one single hierarchy to best explain the semantic relationships among the concepts, and i...
Conference Paper
Full-text available
Autonomous mobile robots equipped with a number of sensors will soon be ubiquitous in human populated environments. In this paper we present an initial exploration into the potential of using such robots for information gathering. We present findings from a formative user survey and a 4-day long Wizard-of-Oz deployment of a robot that answers quest...
Article
Full-text available
Autonomous learning has been a promising direction in control and robotics for more than a decade since data-driven learning allows to reduce the amount of engineering knowledge, which is otherwise required. However, autonomous reinforcement learning (RL) approaches typically require many interactions with the system to learn controllers, which is...
Article
Full-text available
This issue of Autonomous Robots presents journal articles that are based on papers originally presented at the 2013 Robotics Science and Systems conference, held in Berlin, Germany. Although these were selected by a committee to exemplify the best papers presented that year, the decision over which papers to include was a difficult one due to the s...
Article
Functional gradient algorithms (e.g. CHOMP) have recently shown great promise for producing locally optimal motion for complex many degree-of-freedom robots. A key limitation of such algorithms is the difficulty in incorporating constraints and cost functions that explicitly depend on time. We present T-CHOMP, a functional gradient algorithm that o...
Article
In this paper, we attack the problem of learning a predictive model of a depth camera and manipulator directly from raw execution traces. While the problem of learning manipulator models from visual and proprioceptive data has been addressed before, existing techniques often rely on assumptions about the structure of the robot or tracked features i...
Article
As robots become more ubiquitous, it is increasingly important for untrained users to be able to interact with them intuitively. In this work, we investigate how people refer to objects in the world during relatively unstructured communication with robots. We collect a corpus of deictic interactions from users describing objects, which we use to tr...
Conference Paper
Full-text available
Learning policies that generalize across multiple tasks is an important and challenging research topic in reinforcement learning and robotics. Training individual policies for every single potential task is often impractical, especially for continuous task variations, requiring more principled approaches to share and transfer knowledge among simila...
Conference Paper
Tactile sensing plays an important role in robot grasping and object recognition. In this work, we propose a new descriptor named Spatio-Temporal Hierarchical Matching Pursuit (ST-HMP) that captures properties of a time series of tactile sensor measurements. It is based on the concept of unsupervised hierarchical feature learning realized using spa...
Conference Paper
This paper presents an approach for labeling objects in 3D scenes. We introduce HMP3D, a hierarchical sparse coding technique for learning features from 3D point cloud data. HMP3D classifiers are trained using a synthetic dataset of virtual scenes generated using CAD models from an online database. Our scene labeling system combines features learne...
Article
Recently introduced RGB-D cameras are capable of providing high quality synchronized videos of both color and depth. With its advanced sensing capabilities, this technology represents an opportunity to significantly increase the capabilities of object recognition. It also raises the problem of developing expressive features for the color and depth...
Article
RGB-D cameras, such as Microsoft Kinect, are active sensors that provide high-resolution dense color and depth information at real-time frame rates. The wide availability of affordable RGB-D cameras is causing a revolution in perception and changing the landscape of robotics and related fields. RGB-D perception has been the focus of a great deal of...
Article
Full-text available
Learning policies that generalize across multiple tasks is an important and challenging research topic in reinforcement learning and robotics. Training individual policies for every single potential task is often impracticable, especially for continuous task variations, requiring more principled approaches to share and transfer knowledge among simi...
Conference Paper
Complex real-world signals, such as images, contain discriminative structures that differ in many aspects including scale, invariance, and data channel. While progress in deep learning shows the importance of learning features through multiple layers, it is equally important to learn features through multiple paths. We propose Multipath Hierarchica...
Conference Paper
Recent advances have allowed for the creation of dense, accurate 3D maps of indoor environments using RGB-D cameras. Some techniques are able to create large-scale maps, while others focus on accurate details using GPU-accelerated volumetric representations. In this work we describe patch volumes, a novel multiple-volume representation which enable...
Conference Paper
3-D motion estimation is a fundamental problem that has far-reaching implications in robotics. A scene flow formulation is attractive as it makes no assumptions about scene complexity, object rigidity, or camera motion. RGB-D cameras provide new information useful for computing dense 3-D flow in challenging scenes. In this work we show how to gener...
Conference Paper
Over the last years, the robotics community has made substantial progress in detection and 3D pose estimation of known and unknown objects. However, the question of how to identify objects based on language descriptions has not been investigated in detail. While the computer vision community recently started to investigate the use of attributes for...
Article
To understand how versatile dexterity is achieved in the human hand and to achieve it in a robotic form, we have constructed an anatomically correct testbed (ACT) hand. This paper focuses on the development of control strategies for the index finger motion and implementation of joint passive behavior in the ACT hand. A direct muscle position contro...
Article
Full-text available
As robots become more ubiquitous and capable of performing complex tasks, the importance of enabling untrained users to interact with them has increased. In response, unconstrained natural-language interaction with robots has emerged as a significant research area. We discuss the problem of parsing natural language commands to actions and control s...
Article
Full-text available
Recently introduced RGB-D cameras are capable of providing high qual-ity synchronized videos of both color and depth. With its advanced sensing capa-bilities, this technology represents an opportunity to dramatically increase the ca-pabilities of object recognition. It also raises the problem of developing expressive features for the color and dept...
Chapter
Over the last decade, the availability of public image repositories and recognition benchmarks has enabled rapid progress in visual object category and instance detection. Today we are witnessing the birth of a new generation of sensing technologies capable of providing high quality synchronized videos of both color and depth, the RGB-D (Kinect-sty...
Article
As robotic technologies mature, we can imagine an increasing number of applications in which robots could soon prove to be useful in unstructured human environments. Many of those applications require a natural interface between the robot and untrained human users or are possible only in a human-robot collaborative scenario. In this paper, westudy...
Article
Full-text available
We present an application of hierarchical Bayesian estimation to robot map building. The revisiting problem occurs when a robot has to decide whether it is seeing a previously-built portion of a map, or is exploring new territory. This is a difficult decision problem, requiring the probability of being outside of the current known map. To estimate...
Conference Paper
We demonstrate a realtime system which infers and tracks the assembly process of a snap-together block model using a Kinect® sensor. The inference enables us to build a virtual replica of the model at every step. Tracking enables us to provide context specific visual feedback on a screen by augmenting the rendered virtual model aligned with the phy...
Conference Paper
Full-text available
We present a first study of using RGB-D (Kinect-style) cameras for fine-grained recognition of kitchen activities. Our prototype system combines depth (shape) and color (appearance) to solve a number of perception problems crucial for smart space applications: locating hands, identifying objects and their functionalities, recognizing actions and tr...
Article
RGB-D cameras provide both color images and per-pixel depth estimates. The richness of this data and the recent development of low-cost sensors have combined to present an attractive opportunity for mobile robotics research. In this paper, we describe a system for visual odometry and mapping using an RGB-D camera, and its application to autonomous...
Article
Full-text available
We introduce a new dynamic model with the capability of recognizing both activities that an individual is performing as well as where that ndividual is located. Our model is novel in that it utilizes a dynamic graphical model to jointly estimate both activity and spatial context over time based on the simultaneous use of asynchronous observations c...
Article
Full-text available
As robots become more ubiquitous and capable, it becomes ever more important to enable untrained users to easily interact with them. Recently, this has led to study of the language grounding problem, where the goal is to extract representations of the meanings of natural language tied to perception and actuation in the physical world. In this paper...
Conference Paper
Scene labeling research has mostly focused on outdoor scenes, leaving the harder case of indoor scenes poorly understood. Microsoft Kinect dramatically changed the landscape, showing great potentials for RGB-D perception (color+depth). Our main objective is to empirically understand the promises and challenges of scene labeling with RGB-D. We use t...
Article
We propose a view-based approach for labeling objects in 3D scenes reconstructed from RGB-D (color+depth) videos. We utilize sliding window detectors trained from object views to assign class probabilities to pixels in every RGB-D frame. These probabilities are projected into the reconstructed 3D scene and integrated using a voxel representation. W...
Article
Full-text available
Extracting good representations from images is essential for many computer vi-sion tasks. In this paper, we propose hierarchical matching pursuit (HMP), which builds a feature hierarchy layer-by-layer using an efficient matching pursuit en-coder. It includes three modules: batch (tree) orthogonal matching pursuit, spatial pyramid max pooling, and c...
Article
Full-text available
While Iterative Closest Point (ICP) algorithms have been successful at aligning 3D point clouds, they do not take into account constraints arising from sensor viewpoints. More recent beam-based models take into account sensor noise and viewpoint, but problems still remain. In particular, good optimization strategies are still lacking for the beam-b...
Article
Interaction with unstructured groups of objects allows a robot to discover and manipulate novel items in cluttered environments. We present a framework for interactive singulation of individual items from a pile. The proposed framework provides an overall approach for tasks involving operation on multiple objects, such as counting, arranging, or so...
Article
Full-text available
RGB-D cameras (such as the Microsoft Kinect) are novel sensing systems that capture RGB images along with per-pixel depth information. In this paper we investigate how such cameras can be used for building dense 3D maps of indoor environments. Such maps have applications in robot navigation, manipulation, semantic mapping, and telepresence. We pres...
Article
We present an overview of the data collection and transcription eorts for the COnversational Speech In Noisy Environments (COSINE) corpus. The corpus is a set of multi-party conversations recorded in real world environments, with background noise, that can be used to train noise-robust speech recognition systems or develop speech de-noising algorit...
Article
Full-text available
To enable fast deployment and long-term opera-tion, service robots must have the capability to discover and learn about novel objects even in previously unseen parts of the environment. Since most mobile robots are interested in objects for their potential for manipulation, one good way for a robot to find objects is to attempt to manipulate the sc...
Article
Full-text available
Recognizing and manipulating objects is an important task for mobile robots performing useful services in everyday environments. While existing techniques for object recognition related to manipulation provide very good results even for noisy and incomplete data, they are typically trained using data generated in an offline process. As a result, th...
Article
One of the ultimate goals of the field of artificial intelligence and robotics is to develop systems that assist us in our everyday lives by autonomously carrying out a variety of different tasks. To achieve this and to generate appropriate actions, such systems need to be able to accurately interpret their sensory input and estimate their state or...
Conference Paper
Full-text available
Detailed 3D visual models of indoor spaces, from walls and floors to objects and their configurations, can provide extensive knowledge about the environments as well as rich contextual information of people living therein. Vision-based 3D modeling has only seen limited success in applications, as it faces many technical challenges that only a few e...
Conference Paper
Full-text available
Consumer depth cameras, such as the Microsoft Kinect, are capable of providing frames of dense depth values at real time. One fundamental question in utilizing depth cameras is how to best extract features from depth frames. Motivated by local descriptors on images, in particular kernel descriptors, we develop a set of kernel features on depth imag...
Conference Paper
Full-text available
We introduce an algorithm for object discovery from RGB-D (color plus depth) data, building on recent progress in using RGB-D cameras for 3-D reconstruction. A set of 3-D maps are built from multiple visits to the same scene. We introduce a multi-scene MRF model to detect objects that moved between visits, combining shape, visibility, and color cue...
Article
Recognizing possibly thousands of objects is a crucial capability for an autonomous agent to understand and interact with everyday environments. Practical object recognition comes in multiple forms: Is this a coffee mug (category recognition). Is this Alice's coffee mug? (instance recognition). Is the mug with the handle facing left or right? (pose...
Conference Paper
Full-text available
Kernel descriptors provide a unified way to generate rich visual feature sets by turning pixel attributes into patch-level features, and yield impressive results on many object recognition tasks. However, best results with kernel descriptors are achieved using efficient match kernels in conjunction with nonlinear SVMs, which makes it impractical fo...
Conference Paper
Full-text available
In this work we address joint object category and instance recognition in the context of RGB-D (depth) cameras. Motivated by local distance learning, where a novel view of an object is compared to individual views of previously seen objects, we define a view-to-object distance where a novel view is compared simultaneously to all views of a previous...
Conference Paper
Full-text available
Over the last decade, the availability of public image repositories and recognition benchmarks has enabled rapid progress in visual object category and instance detection. Today we are witnessing the birth of a new generation of sensing technologies capable of providing high quality synchronized videos of both color and depth, the RGB-D (Kinect-sty...
Conference Paper
Full-text available
We present HeatWave, a system that uses digital thermal imaging cameras to detect, track, and support user interaction on arbitrary surfaces. Thermal sensing has had limited examination in the HCI research community and is generally under-explored outside of law enforcement and energy auditing applications. We examine the role of thermal imaging as...