Steven M. Seitz’s research while affiliated with University of Mary Washington and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (198)


Inverse Painting: Reconstructing The Painting Process
  • Conference Paper

December 2024

·

2 Reads

Bowei Chen

·

Yifan Wang

·

·

[...]

·

Steven M. Seitz

Figure 7: We fix the total number of inference steps at 200 and evaluate different combinations of T' and K. FID always prefers more denoising steps T', while LPIPS and PSNR are best at a mix of T' and K steps.
Figure 8: Using noisy inpainting to tackle sparse point cloud reconstruction. (a) Shows a sparse point cloud projected to a desired camera angle. (b) Shows the result after our method is used for noisy inpainting.
Figure 10: FFHQ Super-resolution extended results
Constrained Diffusion Implicit Models
  • Preprint
  • File available

November 2024

·

7 Reads

This paper describes an efficient algorithm for solving noisy linear inverse problems using pretrained diffusion models. Extending the paradigm of denoising diffusion implicit models (DDIM), we propose constrained diffusion implicit models (CDIM) that modify the diffusion updates to enforce a constraint upon the final output. For noiseless inverse problems, CDIM exactly satisfies the constraints; in the noisy case, we generalize CDIM to satisfy an exact constraint on the residual distribution of the noise. Experiments across a variety of tasks and metrics show strong performance of CDIM, with analogous inference acceleration to unconstrained DDIM: 10 to 50 times faster than previous conditional diffusion methods. We demonstrate the versatility of our approach on many problems including super-resolution, denoising, inpainting, deblurring, and 3D point cloud reconstruction.

Download

Inverse Painting: Reconstructing The Painting Process

September 2024

·

28 Reads

Given an input painting, we reconstruct a time-lapse video of how it may have been painted. We formulate this as an autoregressive image generation problem, in which an initially blank "canvas" is iteratively updated. The model learns from real artists by training on many painting videos. Our approach incorporates text and region understanding to define a set of painting "instructions" and updates the canvas with a novel diffusion-based renderer. The method extrapolates beyond the limited, acrylic style paintings on which it has been trained, showing plausible results for a wide range of artistic styles and genres.


Figure 1. Method overview. In the lightweight backward motion fine-tuning stage, an input video x = {I0, I1, ..., IN−1} is encoded into the latent space by E(x), and noise is added to create noisy latent zt; during inference, zt is created by iterative denoising starting from zT ∼ N (0, I). (1) Forward motion prediction: we first take the conditioning c0 of the first input image (inference stage) or the first frame in the video (training stage) I0, along with the noisy latent zt to feed into the pre-trained 3D UNet f θ to get the noise predictionsˆvt predictionsˆ predictionsˆvt,0, as well as the temporal self attention maps {Ai}. (2) Backward motion prediction: We reverse the noisy latent zt along temporal axis to get z ′ t . Then we take the conditioning cN−1 of the second input image, or the last frame in the video IN−1, along with the 180-degree rotated temporal self-attention maps {A ′ i }, and feed them through the fine-tuned 3D UNet f θ ′ for backward motion predictionˆvtpredictionˆ predictionˆvt,1. (3) Fuse and update: The predicted backward motion noise is reversed again to fuse with the forward motion noise to create consistent motion path. Note that only the value and output projection matrices W {v,o} in the temporal self-attention layers (green) are fine-tuned; see Fig. 2 for more details.
Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation

August 2024

·

22 Reads

We present a method for generating video sequences with coherent motion between a pair of input key frames. We adapt a pretrained large-scale image-to-video diffusion model (originally trained to generate videos moving forward in time from a single input image) for key frame interpolation, i.e., to produce a video in between two input frames. We accomplish this adaptation through a lightweight fine-tuning technique that produces a version of the model that instead predicts videos moving backwards in time from a single input image. This model (along with the original forward-moving model) is subsequently used in a dual-directional diffusion sampling process that combines the overlapping model estimates starting from each of the two keyframes. Our experiments show that our method outperforms both existing diffusion-based methods and traditional frame interpolation techniques.








Citations (72)


... The synthesis of high-resolution images presents a formidable challenge due to the intrinsic complexities of learning from high-dimensional data and the substantial computational resources necessary to extend image generation beyond the trained resolution. Most recently, some training-free approaches Bar-Tal et al. 2023;Si et al. 2024;Du et al. 2024;Zhang et al. 2023b;Guo et al. 2024;Yang et al. 2024;Wang et al. 2024;Jin et al. 2024) adjust inference strategies or network architectures for higher-resolution generation to add sufficient details to produce high-quality and high-resolution results. Scale-Crafter (He et al. 2023) proposes a re-dilation strategy for dynamically increasing the receptive field in the diffusion UNet (Ronneberger, Fischer, and Brox 2015). ...

Reference:

HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts
Generative Powers of Ten
  • Citing Conference Paper
  • June 2024

... As an alternative approach to address the challenges associated with specialised equipment and controlled environments, the authors of [84] introduce a method for deriving personalised HRTFs using binaural recordings and head tracking data from consumer devices like earbuds with microphones and inertial measurement units (IMUs). By analysing how sound changes with head movement in various environments, the method estimates personalised HRTFs. ...

HRTF Estimation in the Wild
  • Citing Conference Paper
  • October 2023

... Sparse-view Pose Estimation. Traditional correspondence-based Structure-from-Motion [32,29] methods often fail to estimate camera poses in sparse-view settings. Several approaches instead seek to leverage data-driven priors, for example learning energy-based [48,18] or denoising diffusion [39] models to predict cameras. ...

Photo tourism: exploring photo collections in 3D
  • Citing Chapter
  • August 2023

... The models were trained with two different loss functions. The first was a loss function based on the negative SNR, defined as follows: 1 if there are no speakers inside the bubble −10 log [ ∥ s∥ 2 2 ∥ŝ − s∥ 2 2 ] otherwise (7) Here s is the target signal, ŝ is the network output signal, ∥⋅∥ 1 is the L1-norm (equivalently, the sum of element-wise absolute differences) and λ = 50 is a weighting factor. ...

ClearBuds: wireless binaural earbuds for learning-based speech enhancement
  • Citing Conference Paper
  • June 2022

... D-NeRF [43] and Dg-mesh [29]). Evaluations on other monocular video datasets, such as Nerfies [41], are included in the Supplementary material. D-NeRF includes eight sets of dynamic scenes featuring complex motion, such as articulated objects and human actions. ...

Nerfies: Deformable Neural Radiance Fields
  • Citing Conference Paper
  • October 2021

... D-NeRF (Pumarola et al. (2020)) shows synthetic objects captured by 360-orbit inward-facing cameras against a white background (8 scenes). Nerfies ) (4 scenes) and HyperNeRF (Park et al. (2021b)) (17 scenes) data contain general real-world scenes of kitchen table top actions, human faces, and outdoor animals. NeRF-DS ) contains many reflective surfaces in motion, such as silver jugs or glazed ceramic plates held by human hands in indoor tabletop scenes (7 scenes). ...

HyperNeRF: a higher-dimensional representation for topologically varying neural radiance fields
  • Citing Article
  • December 2021

ACM Transactions on Graphics

... F REE-Viewpoint Video (FVV) synthesis from sparse input views is a challenging and crucial task in computer vision, which is largely used in sports broadcasting, stage performance and telepresence systems [1], [2]. However, early attempts [3], [4] try to solve this problem through a weighted blending mechanism [5] by using a huge number of cameras, which dramatically increases computational cost and latency. ...

Project Starline: A high-fidelity telepresence system
  • Citing Article
  • December 2021

ACM Transactions on Graphics

... Let us define these tasks as follows: Reconstruction We maintain a held-out test set for each cluster, X test t , whose test examples are drawn from D t . We evaluate how faithful our personalized prior is through the commonly-used projection-based approach [24,25,28,33] of finding the best latent code in the personalized latent space that reconstructs the test image. This is done by freezing the generator and optimizing over the W + latent space. ...

Time-travel rephotography
  • Citing Article
  • December 2021

ACM Transactions on Graphics

... On the technical side, the solution to this problem aligns with the trajectory of image-and-text to video (IT2V) works [7,16,17,20,22,26,30,32,43,55,57,65,66,80,82,91,93] since they have the same input (single image and text) and output (video) modalities as our problem setting. However, there are critical differences between IT2V and intructional video generation (IVG) that, despite the advances in IT2V, make IVG a challenge. ...

Animating Pictures with Eulerian Motion Fields
  • Citing Conference Paper
  • June 2021