Jamie Shotton's research while affiliated with Microsoft and other places

Publications (146)

Preprint
An accurate model of the environment and the dynamic agents acting in it offers great potential for improving motion planning. We present MILE: a Model-based Imitation LEarning approach to jointly learn a model of the world and a policy for autonomous driving. Our method leverages 3D geometry as an inductive bias and learns a highly compact latent...
Preprint
Full-text available
We demonstrate that it is possible to perform face-related computer vision in the wild using synthetic data alone. The community has long enjoyed the benefits of synthesizing training data with graphics, but the domain gap between real and synthetic data has remained a problem, especially for human faces. Researchers have tried to bridge this gap w...
Preprint
Recent work on Neural Radiance Fields (NeRF) showed how neural networks can be used to encode complex 3D environments that can be rendered photorealistically from novel viewpoints. Rendering these images is very computationally demanding and recent improvements are still a long way from enabling interactive rates, even on high-end hardware. Motivat...
Chapter
Our ability to sample realistic natural images, particularly faces, has advanced by leaps and bounds in recent years, yet our ability to exert fine-tuned control over the generative process has lagged behind. If this new technology is to find practical uses, we need to achieve a level of control over generative networks which, without sacrificing r...
Chapter
Realtime perceptual and interaction capabilities in mixed reality require a range of 3D tracking problems to be solved at low latency on resource-constrained hardware such as head-mounted devices. Indeed, for devices such as HoloLens 2 where the CPU and GPU are left available for applications, multiple tracking subsystems are required to run on a c...
Chapter
Generating photorealistic images of human faces at scale remains a prohibitively difficult task using computer graphics approaches. This is because these require the simulation of light to be photorealistic, which in turn requires physically accurate modelling of geometry, materials, and light sources, for both the head and the surrounding scene. N...
Preprint
Analysis of faces is one of the core applications of computer vision, with tasks ranging from landmark alignment, head pose estimation, expression recognition, and face recognition among others. However, building reliable methods requires time-consuming data collection and often even more time-consuming manual annotation, which can be unreliable. I...
Preprint
Realtime perceptual and interaction capabilities in mixed reality require a range of 3D tracking problems to be solved at low latency on resource-constrained hardware such as head-mounted devices. Indeed, for devices such as HoloLens 2 where the CPU and GPU are left available for applications, multiple tracking subsystems are required to run on a c...
Preprint
Generating photorealistic images of human faces at scale remains a prohibitively difficult task using computer graphics approaches. This is because these require the simulation of light to be photorealistic, which in turn requires physically accurate modelling of geometry, materials, and light sources, for both the head and the surrounding scene. N...
Preprint
Our ability to sample realistic natural images, particularly faces, has advanced by leaps and bounds in recent years, yet our ability to exert fine-tuned control over the generative process has lagged behind. If this new technology is to find practical uses, we need to achieve a level of control over generative networks which, without sacrificing r...
Article
Full-text available
Hand pose estimation has matured rapidly in recent years. The introduction of commodity depth sensors and a multitude of practical applications have spurred new advances. We provide an extensive analysis of the state-of-the-art, focusing on hand pose estimation from a single depth frame. To do so, we have implemented a considerable number of system...
Article
Hand pose estimation, formulated as an inverse problem, is typically optimized by an energy function over pose parameters using a ‘black box’ image generation procedure, knowing little about either the relationships between the parameters or the form of the energy function. In this paper, we show significant improvement upon the black box optimizat...
Article
State-of-the-art computer vision algorithms often achieve efficiency by making discrete choices about which hypotheses to explore next. This allows allocation of computational resources to promising candidates, however, such decisions are non-differentiable. As a result, these algorithms are hard to train in an end-to-end fashion. In this work we p...
Article
Full-text available
RANSAC is an important algorithm in robust optimization and a central building block for many computer vision applications. In recent years, traditionally hand-crafted pipelines have been replaced by deep learning pipelines, which can be trained in an end-to-end fashion. However, RANSAC has so far not been used as part of such deep learning pipelin...
Conference Paper
This paper presents ShadowHands - a novel technique for visualizing a remote user's hand gestures using a single depth sensor and hand tracking system. Previous work has shown that making distributed users better aware of each other's gestures facilitates remote collaboration. These systems presented virtual embodiments as a stream of raw 2D or 3D...
Article
Full-text available
The sixteen papers in this special section focus on human pose recovery and behavior analysis (HuPBA). This is one of the most challenging topics in computer vision, pattern analysis, and machine learning. It is of critical importance for application areas that include gaming, computer interaction, human robot interaction, security, commerce, assis...
Article
Fully articulated hand tracking promises to enable fundamentally new interactions with virtual and augmented worlds, but the limited accuracy and efficiency of current systems has prevented widespread adoption. Today's dominant paradigm uses machine learning for initialization and recovery followed by iterative model-fitting optimization to achieve...
Conference Paper
Full-text available
Published as a conference paper at ICLR 2016 Trained Models at http://dx.doi.org/10.5281/zenodo.53189
Article
This paper investigates the connections between two state of the art classifiers: decision forests (DFs, including decision jungles) and convolutional neural networks (CNNs). Decision forests are computationally efficient thanks to their conditional computation property (computation is confined to only a small region of the tree, the nodes along a...
Article
We propose a new method for creating computationally efficient convolutional neural networks (CNNs) by using low-rank representations of convolutional filters. Rather than approximating filters in previously-trained networks with more efficient versions, we learn a set of small basis filters from scratch; during training, the network learns to comb...
Article
Full-text available
We present a new interactive and online approach to 3D scene understanding. Our system, SemanticPaint, allows users to simultaneously scan their environment, whilst interactively segmenting the scene simply by reaching out and touching any desired object or surface. Our system continuously learns from these segmentations, and labels new unseen part...
Conference Paper
Full-text available
Man-made objects, such as chairs, often have very large shape variations, making it challenging to detect them. In this work we investigate the task of finding particular object shapes from a single depth image. We tackle this task by exploiting the inherently low dimensionality in the object shape variations, which we discover and encode as a comp...
Article
Full-text available
We present a new method for inferring dense data to model correspondences, focusing on the application of human pose estimation from depth images. Recent work proposed the use of regression forests to quickly predict correspondences between depth pixels and points on a 3D human mesh model. That work, however, used a proxy forest training objective...
Article
Recovery from tracking failure is essential in any simultaneous localization and tracking system. In this context, we explore an efficient keyframe-based relocalization method based on frame encoding using randomized ferns. The method enables automatic discovery of keyframes through online harvesting in tracking mode, and fast retrieval of pose can...
Article
Full-text available
Hand pose estimation has matured rapidly in recent years. The introduction of commodity depth sensors and a multitude of practical applications have spurred new advances. We provide an extensive analysis of the state-of-the-art, focusing on hand pose estimation from a single depth frame. To do so, we have implemented a considerable number of system...
Conference Paper
Full-text available
We present a new real-time hand tracking system based on a single depth camera. The system can accurately reconstruct complex hand poses across a variety of subjects. It also allows for robust tracking, rapidly recovering from any temporary failures. Most uniquely, our tracker is highly flexible, dramatically improving upon previous approaches whic...
Technical Report
Full-text available
This paper investigates the connections between two state of the art classifiers: decision forests (DFs, including decision jungles) and convolutional neural networks (CNNs). Decision forests are computationally efficient thanks to their conditional computation property (computation is confined to only a small region of the tree, the nodes along a...
Conference Paper
Full-text available
This paper summarizes the ChaLearn Looking at People 2014 challenge data and the results obtained by the participants. The competition was split into three independent tracks: human pose recovery from RGB data, action and interaction recognition from RGB data sequences, and multi-modal gesture recognition from RGB-Depth sequences. For all the track...
Patent
Learning image processing tasks from scene reconstructions is described where the tasks may include but are not limited to: image de-noising, image in-painting, optical flow detection, interest point detection. In various embodiments training data is generated from a 2 or higher dimensional reconstruction of a scene and from empirical images of the...
Patent
Full-text available
Density estimation and/or manifold learning are described, for example, for computer vision, medical image analysis, text document clustering. In various embodiments a density forest is trained using unlabeled data to estimate the data distribution. In embodiments the density forest comprises a plurality of random decision trees each accumulating p...
Patent
Full-text available
An enhanced training sample set containing new synthesized training images that are artificially generated from an original training sample set is provided to satisfactorily increase the accuracy of an object recognition system. The original sample set is artificially augmented by introducing one or more variations to the original images with littl...
Article
Full-text available
In this paper we report the set-up and results of the Multimodal Brain Tumor Image Segmentation Benchmark (BRATS) organized in conjunction with the MICCAI 2012 and 2013 conferences. Twenty state-of-the-art tumor segmentation algorithms were applied to a set of 65 multi-contrast MR scans of low- and high-grade glioma patients – manually annotated by...
Patent
Full-text available
Using high-level attributes to guide image processing is described. In an embodiment high-level attributes of images of people such as height, torso orientation, body shape, gender are used to guide processing of the images for various tasks including but not limited to joint position detection, body part classification, medical image analysis and...
Patent
Automatic organ localization is described. In an example, an organ in a medical image is localized using one or more trained regression trees. Each image element of the medical image is applied to the trained regression trees to compute probability distributions that relate to a distance from each image element to the organ. At least a subset of th...
Patent
Full-text available
Image labeling with global parameters is described. In an embodiment a pose estimation system executes automatic body part labeling. For example, the system may compute joint recognition or body part segmentation for a gaming application. In another example, the system may compute organ labels for a medical imaging application. In an example, at le...
Conference Paper
This work addresses the problem of estimating the 6D Pose of specific objects from a single RGB-D image. We present a flexible approach that can deal with generic objects, both textured and texture-less. The key new concept is a learned, intermediate representation in form of a dense 3D object coordinate labelling paired with a dense class labellin...
Article
Full-text available
We present a machine learning technique for estimating absolute, per-pixel depth using any conventional monocular 2D camera, with minor hardware modifications. Our approach targets close-range human capture and interaction where dense 3D estimation of hands and faces is desired. We use hybrid classification-regression forests to learn how to map fr...
Conference Paper
Full-text available
We propose 'filter forests' (FF), an efficient new discriminative approach for predicting continuous variables given a signal and its context. FF can be used for general signal restoration tasks that can be tackled via convolutional filter-ing, where it attempts to learn the optimal filtering kernels to be applied to each data point. The model can...
Conference Paper
This paper presents a method for acquiring dense nonrigid shape and deformation from a single monocular depth sensor. We focus on modeling the human hand, and assume that a single rough template model is available. We combine and extend existing work on model-based tracking, subdivision surface fitting, and mesh deformation to acquire detailed hand...
Conference Paper
We address the problem of estimating the pose of a cam- era relative to a known 3D scene from a single RGB-D frame. We formulate this problem as inversion of the generative rendering procedure, i.e., we want to find the camera pose corresponding to a rendering of the 3D scene model that is most similar with the observed input. This is a non-convex...
Patent
Computing pose and/or shape of a modifiable entity is described. In various embodiments a model of an entity (such as a human hand, a golf player holding a golf club, an animal, a body organ) is fitted to an image depicting an example of the entity in a particular pose and shape. In examples, an optimization process finds values of pose and/or shap...
Conference Paper
Full-text available
We present RetroDepth, a new vision-based system for accurately sensing the 3D silhouettes of hands, styluses, and other objects, as they interact on and above physical surfaces. Our setup is simple, cheap, and easily reproducible, comprising of two infrared cameras, diffuse infrared LEDs, and any off-the-shelf retro-reflective material. The retro-...
Patent
Full-text available
A system and method for detecting and tracking targets including body parts and props is described. In one aspect, the disclosed technology acquires one or more depth images, generates one or more classification maps associated with one or more body parts and one or more props, tracks the one or more body parts using a skeletal tracking system, tra...
Patent
Techniques for human body pose estimation are disclosed herein. Images such as depth images, silhouette images, or volumetric images may be generated and pixels or voxels of the images may be identified. The techniques may process the pixels or voxels to determine a probability that each pixel or voxel is associated with a segment of a body capture...
Patent
Foreground and background image segmentation is described. In an example, a seed region is selected in a foreground portion of an image, and a geodesic distance is calculated from each image element to the seed region. A subset of the image elements having a geodesic distance less than a threshold is determined, and this subset of image elements ar...
Patent
Full-text available
A method of tracking a target includes receiving from a source a depth image of a scene including the human subject. The depth image includes a depth for each of a plurality of pixels. The method further includes identifying pixels of the depth image that belong to the human subject and deriving from the identified pixels of the depth image one or...
Patent
Systems and methods are disclosed for identifying objects captured by a depth camera by condensing classified image data into centroids of probability that captured objects are correctly identified entities. Output exemplars are processed to detect spatially localized clusters of non-zero probability pixels. For each cluster, a centroid is generate...
Article
Full-text available
We describe two new approaches to human pose estimation. Both can quickly and accurately predict the 3D positions of body joints from a single depth image without using any temporal information. The key to both approaches is the use of a large, realistic, and highly varied synthetic set of training images. This allows us to learn models that are la...
Patent
Three-dimensional environment reconstruction is described. In an example, a 3D model of a real-world environment is generated in a 3D volume made up of voxels stored on a memory device. The model is built from data describing a camera location and orientation, and a depth image with pixels indicating a distance from the camera to a point in the env...
Patent
Use of a 3D environment model in gameplay is described. In an embodiment, a mobile depth camera is used to capture a series of depth images as it is moved around and a dense 3D model of the environment is generated from this series of depth images. This dense 3D model is incorporated within an interactive application, such as a game. The mobile dep...
Patent
Predicting joint positions is described, for example, to find joint positions of humans or animals (or parts thereof) in an image to control a computer game or for other applications. In an embodiment image elements of a depth image make joint position votes so that for example, an image element depicting part of a torso may vote for a position of...
Conference Paper
We introduce an efficient camera relocalization approach which can be easily integrated into real-time 3D reconstruction methods, such as KinectFusion. Our approach makes use of compact encoding of whole image frames which enables both online harvesting of keyframes in tracking mode, and fast retrieval of pose proposals when tracking is lost. The e...
Patent
A computerized decision tree training system may include a distributed control processing unit configured to receive input of training data for training a decision tree. The system may further include a plurality of data batch processing units, each data batch processing unit being configured to evaluate each of a plurality of split functions of a...
Conference Paper
Full-text available
We introduce new approaches for augmenting annotated training datasets used for object detection tasks that serve achieving two goals: reduce the effort needed for collecting and manually annotating huge datasets and introduce novel variations to the initial dataset that help the learning algorithms. The methods presented in this work aim at reloca...
Article
Kinect sensor, high-resolution depth and visual (RGB) sensing has become available for widespread use as an off-the-shelf technology. This special issue is specifically dedicated to new algorithms and/or new applications based on the Kinect (or similar RGB-D) sensors. In total, we received over ninety submissions from more than twenty countries all...
Patent
Full-text available
Systems and methods for estimating a posture of a body part of a user are disclosed. In one disclosed embodiment, an image is received from a sensor, where the image includes at least a portion of an image of the user including the body part. The skeleton information of the user is estimated from the image, a region of the image corresponding to th...
Article
With the invention of the low-cost Microsoft Kinect sensor, high-resolution depth and visual (RGB) sensing has become available for widespread use. The complementary nature of the depth and visual information provided by the Kinect sensor opens up new opportunities to solve fundamental problems in computer vision. This paper presents a comprehensiv...
Conference Paper
We address the problem of inferring the pose of an RGB-D camera relative to a known 3D scene, given only a single acquired image. Our approach employs a regression forest that is capable of inferring an estimate of each pixel's correspondence to 3D points in the scene's world coordinate frame. The forest uses only simple depth and RGB pixel compari...
Conference Paper
Conventional decision forest based methods for image labelling tasks like object segmentation make predictions for each variable (pixel) independently [3, 5, 8]. This prevents them from enforcing dependencies between variables and translates into locally inconsistent pixel labellings. Random field models, instead, encourage spatial consistency of l...
Conference Paper
Image segmentation is the process of partitioning an image into segments or subsets of pixels for purposes of further analysis, such as separating the interesting objects in the foreground from the un-interesting objects in the background. In many image processing applications, the process requires a sequence of computational steps on a per pixel b...
Patent
Moving object segmentation using depth images is described. In an example, a moving object is segmented from the background of a depth image of a scene received from a mobile depth camera. A previous depth image of the scene is retrieved, and compared to the current depth image using an iterative closest point algorithm. The iterative closest point...
Article
This paper proposes a new algorithm for the efficient, automatic detection and localization of multiple anatomical structures within three-dimensional computed tomography (CT) scans. Applications include selective retrieval of patients images from PACS systems, semantic visual navigation and tracking radiation dose over time. The main contribution...
Chapter
This chapter discusses the use of regression forests for the automatic detection and simultaneous localization of multiple anatomical regions within computed tomography (CT) and magnetic resonance (MR) three-dimensional images. Important applications include: organ-specific tracking of radiation dose over time; selective retrieval of patient images...
Chapter
This chapter describes a variety of techniques for writing efficient, scalable, and general-purpose decision forest software. It will cover:- - algorithmic considerations, such as how to train in depth first or breadth first order; - optimizations, such as cheaply evaluating multiple thresholds for a given feature; - designing for multi-core, GP...
Chapter
Problems related to the automatic or semi-automatic analysis of complex data such as photographs, videos, medical scans, text or genomic data can all be categorized into a relatively small set of prototypical machine learning tasks. The popularity of decision forests is mostly due to their recent success in classification tasks. However, forests ar...