Steven M. Seitz's research while affiliated with University of Washington Seattle and other places
What is this page?
This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
Publications (192)
We present ClearBuds, the first hardware and software system that utilizes a neural network to enhance speech streamed from two wireless earbuds. Real-time speech enhancement for wireless earbuds requires high-quality sound separation and background cancellation, operating in real-time and on a mobile phone. Clear-Buds bridges state-of-the-art deep...
Many historical people were only ever captured by old, faded, black and white photos, that are distorted due to the limitations of early cameras and the passage of time. This paper simulates traveling back in time with a modern camera to rephotograph famous subjects. Unlike conventional image restoration filters which apply independent operations l...
We present a real-time bidirectional communication system that lets two people, separated by distance, experience a face-to-face conversation as if they were copresent. It is the first telepresence system that is demonstrably better than 2D videoconferencing, as measured using participant ratings (e.g., presence, attentiveness, reaction-gauging, en...
Neural Radiance Fields (NeRF) are able to reconstruct scenes with unprecedented fidelity, and various recent works have extended NeRF to handle dynamic scenes. A common approach to reconstruct such non-rigid scenes is through the use of a learned deformation field mapping from coordinates in each input image into a canonical template coordinate spa...
Nonprehensile manipulation involves long horizon underactuated object interactions and physical contact with different objects that can inherently introduce a high degree of uncertainty. In this work, we introduce a novel Real-to-Sim reward analysis technique, called Riemannian Motion Predictive Control (RMPC), to reliably imagine and predict the o...
Neural Radiance Fields (NeRF) are able to reconstruct scenes with unprecedented fidelity, and various recent works have extended NeRF to handle dynamic scenes. A common approach to reconstruct such non-rigid scenes is through the use of a learned deformation field mapping from coordinates in each input image into a canonical template coordinate spa...
We present a framework for automatically reconfiguring images of street scenes by populating, depopulating, or repopulating them with objects such as pedestrians or vehicles. Applications of this method include anonymizing images to enhance privacy, generating data augmentations for perception tasks like autonomous driving, and composing scenes to...
Nonprehensile manipulation involves long horizon underactuated object interactions and physical contact with different objects that can inherently introduce a high degree of uncertainty. In this work, we introduce a novel Real-to-Sim reward analysis technique, called Riemannian Motion Predictive Control (RMPC), to reliably imagine and predict the o...
Many historical people are captured only in old, faded, black and white photos, that have been distorted by the limitations of early cameras and the passage of time. This paper simulates traveling back in time with a modern camera to rephotograph famous subjects. Unlike conventional image restoration filters which apply independent operations like...
In this paper, we demonstrate a fully automatic method for converting a still image into a realistic animated looping video. We target scenes with continuous fluid motion, such as flowing water and billowing smoke. Our method relies on the observation that this type of natural motion can be convincingly reproduced from a static Eulerian motion desc...
We present the first method capable of photorealistically reconstructing a non-rigidly deforming scene using photos/videos captured casually from mobile phones. Our approach -- D-NeRF -- augments neural radiance fields (NeRF) by optimizing an additional continuous volumetric deformation field that warps each observed point into a canonical 5D NeRF....
By analyzing the motion of people and other objects in a scene, we demonstrate how to infer depth, occlusion, lighting, and shadow information from video taken from a single camera viewpoint. This information is then used to composite new objects into the same scene with a high degree of automation and realism. In particular, when a user places a n...
Great progress has been made in 3D body pose and shape estimation from a single photo. Yet, state-of-the-art results still suffer from errors due to challenging body poses, modeling clothing, and self occlusions. The domain of basketball games is particularly challenging, as it exhibits all of these challenges. In this paper, we introduce a new app...
By moving a depth sensor around a room, we compute a 3D CAD model of the environment, capturing the room shape and contents such as chairs, desks, sofas, and tables. Rather than reconstructing geometry, we match, place, and align each object in the scene to thousands of CAD models of objects. In addition to the end-to-end system, the key technical...
Existing online 3D shape repositories contain thousands of 3D models but lack photorealistic appearance. We present an approach to automatically assign high-quality, realistic appearance models to large scale 3D shape collections. The key idea is to jointly leverage three types of online data -- shape collections, material collections, and photo co...
Given an Internet photo collection of a landmark, we compute a 3D time-lapse
video sequence where a virtual camera moves continuously in time and space.
While previous work assumed a static camera, the addition of camera motion
during the time-lapse creates a very compelling impression of parallax.
Achieving this goal, however, requires addressing...
This paper describes a simple 3D display that can be built from a tablet computer and a plastic sheet folded into a cone. This display allows viewing a three-dimensional object from any direction over a 360-degree path of travel without the use of special glasses. Inspired by the classic Pepper's Ghost illusion, our approach uses a curved transpare...
Rendering 3D-360° VR video from a camera rig is computation-intensive and typically performed offline. In this paper, we target the most time-consuming step of the VR video creation process, high-quality flow estimation with the bilateral solver. We propose a new algorithm, the hardware-friendly bilateral solver, that enables faster runtimes than e...
Given audio of President Barack Obama, we synthesize a high quality video of him speaking with accurate lip sync, composited into a target video clip. Trained on many hours of his weekly address footage, a recurrent neural network learns the mapping from raw audio features to mouth shapes. Given the mouth shape at each time instant, we synthesize h...
We present Jump, a practical system for capturing high resolution, omnidirectional stereo (ODS) video suitable for wide scale consumption in currently available virtual reality (VR) headsets. Our system consists of a video camera built using off-the-shelf components and a fully automatic stitching pipeline capable of capturing video content in the...
We present an interactive system to capture CAD-like 3D models of indoor scenes, on a mobile device. To overcome sensory and computational limitations of the mobile platform, we employ an in situ, semi-automated approach and harness the user's high-level knowledge of the scene to assist the reconstruction and modeling algorithms. The modeling proce...
Given a single photo of a room and a large database of furniture CAD models, our goal is to reconstruct a scene that is as similar as possible to the scene depicted in the photograph, and composed of objects drawn from the database. We present a completely automatic system to address this IM2CAD problem that produces high quality results on challen...
We introduce an approach for synthesizing time-lapse videos of popular landmarks from large community photo collections. The approach is completely automated and leverages the vast quantity of photos available online. First, we cluster 86 million photos into landmarks and popular viewpoints. Then, we sort the photos by date and warp each photo onto...
Systems, methods and articles of manufacture for generating sequences of face and expression aligned images are presented. An embodiment includes determining a plurality of candidate images, computing a similarity distance between an input image and each of the candidate images based on facial features in the input image and the candidate images, c...
Recent face recognition experiments on the LFW benchmark show that face
recognition is performing stunningly well, surpassing human recognition rates.
In this paper, we study face recognition at scale. Specifically, we have
collected from Flickr a \textbf{Million} faces and evaluated state of the art
face recognition algorithms on this dataset. We...
System and method for rendering a sequence of orthographic approximation images corresponding to camera poses to generate an animation moving between an initial view and a final view of a target area are provided. An initial image corresponding to an initial camera pose directed at the target area is identified. A final image and an associated dept...
System and methods for generating a model of an environment are provided. In some aspects, a system includes a layer module configured to identify one or more layers of the environment based on a plurality of three-dimensional (3D) points mapping the environment. The system also includes a layout module configured to generate a layout for each laye...
System and method for rendering a sequence of images corresponding to a sequence of camera poses of a target area to generate an animation representative of a progression of camera poses are provided. An initial image and an associated initial depthmap of a target area captured from an initial camera pose, and a final image and an associated final...
We address the problem of geo-registering ground-based multi-view stereo models by ground-to-aerial image matching. The main contribution is a fully automated geo-registration pipeline with a novel viewpoint-dependent matching method that handles ground to aerial viewpoint variation. We conduct large-scale experiments which consist of many popular...
A system and machine-implemented method for providing one or more photos associated with a point of interest on a map, the method including receiving an indication of a request from a user to view photos associated with a point of interest on a map, identifying a set of photos associated with the point of interest, wherein the photos comprise at le...
We present an approach that takes a single photograph of a child as input and automatically produces a series of age-progressed outputs between 1 and 80 years of age, accounting for pose, expression, and illumination. Leveraging thousands of photos of children and adults at many ages from the Internet, we first show how to compute average image sub...
We present an approach that takes a single video of a person’s face and reconstructs a high detail 3D shape for each video frame. We target videos taken under uncontrolled and uncalibrated imaging conditions, such as youtube videos of celebrities. In the heart of this work is a new dense 3D flow estimation method coupled with shape from shading. Un...
We address the problem of extending the field of view of a photo-an operation we call uncrop. Given a reference photograph to be uncropped, our approach selects, reprojects, and composites a subset of Internet imagery taken near the reference into a larger image around the reference using the underlying scene geometry. The proposed Markov Random Fi...
We introduce an approach for analyzing annotated maps of a site, together with Internet photos, to reconstruct large indoor spaces of famous tourist sites. While current 3D reconstruction algorithms often produce a set of disconnected components (3D pieces) for indoor scenes due to scene coverage or matching failures, we make use of a provided map...
We present an approach for generating face animations from large image collections of the same person. Such collections, which we call photobios, are remarkable in that they summarize a person's life in photos; the photos sample the appearance of a person over changes in age, pose, facial expression, hairstyle, and other variations. Yet, browsing a...
An exemplary method includes prompting a user to capture video data at a location. The location is associated with navigation directions for the user. Information representing visual orientation and positioning information associated with the captured video data is received by one or more computing devices, and a stored data model representing a 3D...
Over the past few years there has been a dramatic proliferation of digital cameras, and it has become increasingly easy to share large numbers of photographs with many other people. These trends have contributed to the availability of large databases of photographs. Effectively organizing, browsing, and visualizing such .seas. of images, as well as...
This paper leverages occluding contours (aka 'internal silhouettes') to improve the performance of multi-view stereo methods. The contributions are 1) a new technique to identify free-space regions arising from occluding contours, and 2) a new approach for incorporating the resulting free-space constraints into Poisson surface reconstruction. The p...
We describe a system for searching your personal photos using an extremely wide range of text queries, including dates and holidays ("Halloween"), named and categorical places ("Empire State Building" or "park"), events and occasions ("Radiohead concert" or "wedding"), activities ("skiing"), object categories ("whales"), attributes ("outdoors"), an...
We introduce an approach for analyzing Wikipedia and other text, together with online photos, to produce annotated 3D models of famous tourist sites. The approach is completely automated, and leverages online text and photo co-occurrences via Google Image Search. It enables a number of new interactions, which we demonstrate in a new 3D visualizatio...
The last decade has seen an explosion in the number of photographs available on the Internet. The sheer volume of interesting photos makes it a challenge to explore this space. VariousWeb and social media sites, along with search and indexing techniques, have been developed in response. One natural way to navigate these images in a 3D geo-located c...
We present the first large scale system for capturing and rendering relight able scene reconstructions from massive unstructured photo collections taken under different illumination conditions and viewpoints. We combine photos taken from many sources, Flickr-Based ground-level imagery, oblique aerial views, and street view, to recover models that a...
We present a novel approach for the single view reconstruction (SVR) of piecewise swept scenes, exploiting the regular structure present in these man-made scenes. The parallelism of lines and its extension to curves are used as cues within a novel sequential algorithm propagating information across faces and their boundaries. Using this approach we...
A “Blur Remover” provides various techniques for constructing deblurred images from a sequence of motion-blurred images such as a video sequence of a scene. Significantly, this deblurring is accomplished without requiring specialized side information or camera setups. In fact, the Blur Remover receives sequential images, such as a typical video str...
In this paper, we present a novel smartphone application designed to easily capture, visualize and reconstruct homes, offices and other indoor scenes. Our application leverages data from smartphone sensors such as the camera, accelerometer, gyroscope and magnetometer to help model the indoor scene. The output of the system is two-fold; first, an in...
Past mosaicing approaches stitch a set of photos into a single static mosaic. We present a novel approach where we visualize a photo collection in an interactive viewer that allows the user to smoothly and seamlessly transition between a collection of local mosaics instead of a single static mosaic. Such an approach works with more general photo co...
This paper describes an effort to automatically create "tours" of thousands of the world's landmarks from geo-tagged user-contributed photos on the Internet. These photo tours take you through each site's most popular viewpoints on a tour that maximizes visual quality and traversal efficiency. This planning problem is framed as a form of the Travel...
Computing optical flow between any pair of Internet face photos is challenging for most current state of the art flow estimation methods due to differences in illumination, pose, and geometry. We show that flow estimation can be dramatically improved by leveraging a large photo collection of the same (or similar) object. In particular, consider the...
This paper introduces a schematic representation for architectural scenes together with robust algorithms for reconstruction from sparse 3D point cloud data. The schematic models architecture as a network of transport curves, approximating a floorplan, with associated profile curves, together comprising an interconnected set of swept surfaces. The...
Computing optical flow between any pair of Internet face photos is challenging for most current state of the art flow estimation methods due to differences in illumination, pose, and geometry. We show that flow estimation can be dramatically improved by leveraging a large photo collection of the same (or similar) object. In particular, consider the ca...
We address the problem of reconstructing 3D face models from large unstructured photo collections, e.g., obtained by Google image search or from personal photo collections in iPhoto. This problem is extremely challenging due to the high degree of variability in pose, illumination, facial expression, non-rigid changes in face shape and reflectance o...
We present a system that can reconstruct 3D geometry from large, unorganized collections of photographs such as those found by searching for a given city (e.g., Rome) on Internet photo-sharing sites. Our system is built on a set of new, distributed computer vision algorithms for image matching and 3D reconstruction, designed to maximize parallelism...
Detailed 3D visual models of indoor spaces, from walls and floors to objects and their configurations, can provide extensive knowledge about the environments as well as rich contextual information of people living therein. Vision-based 3D modeling has only seen limited success in applications, as it faces many technical challenges that only a few e...
We present an approach for generating face animations from large image collections of the same person. Such collections, which we call photobios, sample the appearance of a person over changes in pose, facial expression, hairstyle, age, and other variations. By optimizing the order in which images are displayed and cross-dissolving between them, we...
Given a community-contributed set of photos of a crowded public event, this paper addresses the problem of finding all images of each person in the scene. This problem is very challenging due to large changes in camera viewpoints, severe occlusions, low resolution and photos from tens or hundreds of different photographers. Despite these challenges...
We present the design and implementation of new inexact Newton type Bundle Adjustment algorithms that exploit hardware parallelism for efficiently solving large scale 3D scene reconstruction problems. We explore the use of multicore CPU as well as multicore GPUs for this purpose. We show that overcoming the severe memory and bandwidth limitations o...
This paper considers the problem of computing scene depth from a stereo pair of cameras under a sequence of illumination directions. By integrating parallax and shading cues, we obtain both metric depth and fine surface details. Casting this problem into the filter flow framework [16], enables a convex formulation of the problem, and thus a globall...
We present the design and implementation of a new inexact Newton type algorithm for solving large-scale bundle adjustment
problems with tens of thousands of images. We explore the use of Conjugate Gradients for calculating the Newton step and its
performance as a function of some simple and computationally efficient preconditioners. We show that t...
Given a photo of person A, we seek a photo of person B with similar pose and expression. Solving this problem enables a form of puppetry, in which one person appears to control the face of another. When deployed on a webcam-equipped computer, our approach enables a user to control another person's face in real-time. This image-retrieval-inspired ap...
There are billions of photographs on the Internet, representing an extremely large, rich, and nearly comprehensive visual record of virtually every famous place on Earth. Unfortunately, these massive community photo collections are almost completely unstructured, making it very difficult to use them for applications such as the virtual exploration...
Community photo collections like Flickr offer a rich, ever-growing record of the world around us. New computer vision techniques can use photographs from these collections to rapidly build detailed 3D models.
We present a new image morphing approach in which the output sequence is regenerated from small pieces of the two source (input) images. The approach does not require manual correspondence, and generates compelling results even when the images are of very different objects (e.g., a cloud and a face). We pose the morphing task as an optimization wit...
This paper describes a photometric stereo method designed for surfaces with spatially-varying BRDFs, including surfaces with both varying diffuse and specular properties. Our optimization-based method builds on the observation that most objects are composed of a small number of fundamental materials by constraining each pixel to be representable by...
This paper introduces an approach for enabling existing multi-view stereo methods to operate on extremely large unstructured photo collections. The main idea is to decompose the collection into a set of overlapping sets of photos that can be processed in parallel, and to merge the resulting reconstructions. This overlapping clustering problem is fo...
This paper proposes a fully automated 3D reconstruction and visualization system for architectural scenes (interiors and exteriors). The reconstruction of indoor environments from photographs is particularly challenging due to texture-poor planar surfaces such as uniformly-painted walls. Our system first uses structure-from-motion, multi-view stere...
The filter flow problem is to compute a space-variant linear filter that transforms one image into another. This framework encompasses a broad range of transformations including stereo, optical flow, lighting changes, blur, and combinations of these effects. Parametric models such as affine motion, vignetting, and radial distortion can also be mode...
We present a system that can match and reconstruct 3D scenes from extremely large collections of photographs such as those found by searching for a given city (e.g., Rome) on Internet photo sharing sites. Our system uses a collection of novel parallel distributed matching and reconstruction algorithms, designed to maximize parallelism at each stage...
Low-rank approximation of image collections (e.g., via PCA) is a popular tool in many areas of computer vision. Yet, surprisingly little is known justifying the observation that images of an object or scene tend to be low dimensional, beyond the special case of Lambertian scenes. This paper considers the question of how many basis images are needed...
We address the problem of automatically aligning structure-from-motion reconstructions to overhead images, such as satellite images, maps and floor plans, generated from an orthographic camera. We compute the optimal alignment using an objective function that matches 3D points to image edges and imposes free space constraints based on the visibilit...
Multi-view stereo (MVS) algorithms now produce reconstructions that rival laser range scanner accuracy. However, stereo algorithms require textured surfaces, and therefore work poorly for many architectural scenes (e.g., building interiors with textureless, painted walls). This paper presents a novel MVS approach to overcome these limitations for M...
There are billions of photographs on the Internet, comprising the largest and most diverse photo collection ever assembled.
How can computer vision researchers exploit this imagery? This paper explores this question from the standpoint of 3D scene
modeling and visualization. We present structure-from-motion and image-based rendering algorithms that...
We explore the use of tracked 2D object motion to enable novel approaches to interacting with video. These include moving annotations, video navigation by direct manipulation of objects, and creating an image composite from multiple video frames. Features in the video are automatically tracked and grouped in an off-line preprocess that enables late...
Given a collection of images of a static scene taken by many different people, we identify and segment interesting objects.
To solve this problem, we use the distribution of images in the collection along with a new field-of-view cue, which leverages the observation that people tend to take photos that frame an object of interest within the field o...
When a scene is photographed many times by different people, the viewpoints often cluster along certain paths. These paths are largely specific to the scene being photographed, and follow interesting regions and viewpoints. We seek to discover a range of such paths and turn them into controls for image-based rendering. Our approach takes as input a...