December 2024
·
2 Reads
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
December 2024
·
2 Reads
November 2024
·
7 Reads
This paper describes an efficient algorithm for solving noisy linear inverse problems using pretrained diffusion models. Extending the paradigm of denoising diffusion implicit models (DDIM), we propose constrained diffusion implicit models (CDIM) that modify the diffusion updates to enforce a constraint upon the final output. For noiseless inverse problems, CDIM exactly satisfies the constraints; in the noisy case, we generalize CDIM to satisfy an exact constraint on the residual distribution of the noise. Experiments across a variety of tasks and metrics show strong performance of CDIM, with analogous inference acceleration to unconstrained DDIM: 10 to 50 times faster than previous conditional diffusion methods. We demonstrate the versatility of our approach on many problems including super-resolution, denoising, inpainting, deblurring, and 3D point cloud reconstruction.
September 2024
·
28 Reads
Given an input painting, we reconstruct a time-lapse video of how it may have been painted. We formulate this as an autoregressive image generation problem, in which an initially blank "canvas" is iteratively updated. The model learns from real artists by training on many painting videos. Our approach incorporates text and region understanding to define a set of painting "instructions" and updates the canvas with a novel diffusion-based renderer. The method extrapolates beyond the limited, acrylic style paintings on which it has been trained, showing plausible results for a wide range of artistic styles and genres.
August 2024
·
22 Reads
We present a method for generating video sequences with coherent motion between a pair of input key frames. We adapt a pretrained large-scale image-to-video diffusion model (originally trained to generate videos moving forward in time from a single input image) for key frame interpolation, i.e., to produce a video in between two input frames. We accomplish this adaptation through a lightweight fine-tuning technique that produces a version of the model that instead predicts videos moving backwards in time from a single input image. This model (along with the original forward-moving model) is subsequently used in a dual-directional diffusion sampling process that combines the overlapping model estimates starting from each of the two keyframes. Our experiments show that our method outperforms both existing diffusion-based methods and traditional frame interpolation techniques.
June 2024
·
3 Reads
·
1 Citation
June 2024
·
1 Citation
October 2023
·
9 Reads
·
3 Citations
August 2023
·
44 Reads
·
54 Citations
June 2022
·
11 Reads
·
11 Citations
June 2022
·
20 Reads
·
34 Citations
... The synthesis of high-resolution images presents a formidable challenge due to the intrinsic complexities of learning from high-dimensional data and the substantial computational resources necessary to extend image generation beyond the trained resolution. Most recently, some training-free approaches Bar-Tal et al. 2023;Si et al. 2024;Du et al. 2024;Zhang et al. 2023b;Guo et al. 2024;Yang et al. 2024;Wang et al. 2024;Jin et al. 2024) adjust inference strategies or network architectures for higher-resolution generation to add sufficient details to produce high-quality and high-resolution results. Scale-Crafter (He et al. 2023) proposes a re-dilation strategy for dynamically increasing the receptive field in the diffusion UNet (Ronneberger, Fischer, and Brox 2015). ...
June 2024
... As an alternative approach to address the challenges associated with specialised equipment and controlled environments, the authors of [84] introduce a method for deriving personalised HRTFs using binaural recordings and head tracking data from consumer devices like earbuds with microphones and inertial measurement units (IMUs). By analysing how sound changes with head movement in various environments, the method estimates personalised HRTFs. ...
October 2023
... Sparse-view Pose Estimation. Traditional correspondence-based Structure-from-Motion [32,29] methods often fail to estimate camera poses in sparse-view settings. Several approaches instead seek to leverage data-driven priors, for example learning energy-based [48,18] or denoising diffusion [39] models to predict cameras. ...
August 2023
... The models were trained with two different loss functions. The first was a loss function based on the negative SNR, defined as follows: 1 if there are no speakers inside the bubble −10 log [ ∥ s∥ 2 2 ∥ŝ − s∥ 2 2 ] otherwise (7) Here s is the target signal, ŝ is the network output signal, ∥⋅∥ 1 is the L1-norm (equivalently, the sum of element-wise absolute differences) and λ = 50 is a weighting factor. ...
Reference:
Hearable devices with sound bubbles
June 2022
... D-NeRF [43] and Dg-mesh [29]). Evaluations on other monocular video datasets, such as Nerfies [41], are included in the Supplementary material. D-NeRF includes eight sets of dynamic scenes featuring complex motion, such as articulated objects and human actions. ...
October 2021
... D-NeRF (Pumarola et al. (2020)) shows synthetic objects captured by 360-orbit inward-facing cameras against a white background (8 scenes). Nerfies ) (4 scenes) and HyperNeRF (Park et al. (2021b)) (17 scenes) data contain general real-world scenes of kitchen table top actions, human faces, and outdoor animals. NeRF-DS ) contains many reflective surfaces in motion, such as silver jugs or glazed ceramic plates held by human hands in indoor tabletop scenes (7 scenes). ...
December 2021
ACM Transactions on Graphics
... F REE-Viewpoint Video (FVV) synthesis from sparse input views is a challenging and crucial task in computer vision, which is largely used in sports broadcasting, stage performance and telepresence systems [1], [2]. However, early attempts [3], [4] try to solve this problem through a weighted blending mechanism [5] by using a huge number of cameras, which dramatically increases computational cost and latency. ...
December 2021
ACM Transactions on Graphics
... Let us define these tasks as follows: Reconstruction We maintain a held-out test set for each cluster, X test t , whose test examples are drawn from D t . We evaluate how faithful our personalized prior is through the commonly-used projection-based approach [24,25,28,33] of finding the best latent code in the personalized latent space that reconstructs the test image. This is done by freezing the generator and optimizing over the W + latent space. ...
December 2021
ACM Transactions on Graphics
... On the technical side, the solution to this problem aligns with the trajectory of image-and-text to video (IT2V) works [7,16,17,20,22,26,30,32,43,55,57,65,66,80,82,91,93] since they have the same input (single image and text) and output (video) modalities as our problem setting. However, there are critical differences between IT2V and intructional video generation (IVG) that, despite the advances in IT2V, make IVG a challenge. ...
Reference:
Instructional Video Generation
June 2021
... 3D scene manipulation and inpainting. Early works [63,82,41,56,1,29,79,80,78,89] explored street scene editing by leveraging single-view or multi-view image inpainting networks. With the rapid development of Neural Scene Representation, editing a 3D scene has been explored by lots of works [10,88,75,81,2,23,19,40]. ...
June 2021