Ayush Tewari’s research while affiliated with Massachusetts Institute of Technology and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (79)


Manifold Sampling for Differentiable Uncertainty in Radiance Fields
  • Conference Paper

December 2024

·

2 Reads

Linjie Lyu

·

Ayush Tewari

·

·

[...]

·

Christian Theobalt

Fig. 3. Different variants to model an uncertainty volume í µí±‰ in the space of radiance field parameters (top row, only three out of millions of parameters are shown) using different covariance matrices Σ (bottom row, 20 dimensions are shown). (a) A full Σ is the most expressive solution that leads to an arbitrarily shaped parallelotope, but it suffers from an intractable number of parameters. (b) Restricting Σ to a diagonal matrix is a sparse solution, but it can only represent axis-aligned hyper-rectangles. (c) A block-diagonal Σ is slightly more expressive, but it requires making representation-specific independence assumptions and small blocks to stay tractable. (d) Our solution employs a low-rank covariance matrix, which results in a manifold parallelotope (here a 2D parallelogram). This parameterization is highly efficient to train and results in expressive uncertainty estimates.
Numerical evaluation for novel-view synthesis with active camera selection on the NeRF Synthetic dataset [Mildenhall et al. 2020].
Manifold Sampling for Differentiable Uncertainty in Radiance Fields
  • Preprint
  • File available

September 2024

·

18 Reads

Radiance fields are powerful and, hence, popular models for representing the appearance of complex scenes. Yet, constructing them based on image observations gives rise to ambiguities and uncertainties. We propose a versatile approach for learning Gaussian radiance fields with explicit and fine-grained uncertainty estimates that impose only little additional cost compared to uncertainty-agnostic training. Our key observation is that uncertainties can be modeled as a low-dimensional manifold in the space of radiance field parameters that is highly amenable to Monte Carlo sampling. Importantly, our uncertainties are differentiable and, thus, allow for gradient-based optimization of subsequent captures that optimally reduce ambiguities. We demonstrate state-of-the-art performance on next-best-view planning tasks, including high-dimensional illumination planning for optimal radiance field relighting quality.

Download



Diffusion Posterior Illumination for Ambiguity-Aware Inverse Rendering

December 2023

·

15 Reads

·

17 Citations

ACM Transactions on Graphics

Inverse rendering, the process of inferring scene properties from images, is a challenging inverse problem. The task is ill-posed, as many different scene configurations can give rise to the same image. Most existing solutions incorporate priors into the inverse-rendering pipeline to encourage plausible solutions, but they do not consider the inherent ambiguities and the multi-modal distribution of possible decompositions. In this work, we propose a novel scheme that integrates a denoising diffusion probabilistic model pre-trained on natural illumination maps into an optimization framework involving a differentiable path tracer. The proposed method allows sampling from combinations of illumination and spatially-varying surface materials that are, both, natural and explain the image observations. We further conduct an extensive comparative study of different priors on illumination used in previous work on inverse rendering. Our method excels in recovering materials and producing highly realistic and diverse environment map samples that faithfully explain the illumination of the input images.


AvatarStudio: Text-Driven Editing of 3D Dynamic Human Head Avatars

December 2023

·

21 Reads

·

20 Citations

ACM Transactions on Graphics

Capturing and editing full-head performances enables the creation of virtual characters with various applications such as extended reality and media production. The past few years witnessed a steep rise in the photorealism of human head avatars. Such avatars can be controlled through different input data modalities, including RGB, audio, depth, IMUs, and others. While these data modalities provide effective means of control, they mostly focus on editing the head movements such as the facial expressions, head pose, and/or camera viewpoint. In this paper, we propose AvatarStudio, a text-based method for editing the appearance of a dynamic full head avatar. Our approach builds on existing work to capture dynamic performances of human heads using Neural Radiance Field (NeRF) and edits this representation with a text-to-image diffusion model. Specifically, we introduce an optimization strategy for incorporating multiple keyframes representing different camera viewpoints and time stamps of a video performance into a single diffusion model. Using this personalized diffusion model, we edit the dynamic NeRF by introducing view-and-time-aware Score Distillation Sampling (VT-SDS) following a model-based guidance approach. Our method edits the full head in a canonical space and then propagates these edits to the remaining time steps via a pre-trained deformation network. We evaluate our method visually and numerically via a user study, and results show that our method outperforms existing approaches. Our experiments validate the design choices of our method and highlight that our edits are genuine, personalized, as well as 3D- and time-consistent.


A Deeper Analysis of Volumetric Relightable Faces

October 2023

·

89 Reads

·

3 Citations

International Journal of Computer Vision

Portrait viewpoint and illumination editing is an important problem with several applications in VR/AR, movies, and photography. Comprehensive knowledge of geometry and illumination is critical for obtaining photorealistic results. Current methods are unable to explicitly model in 3D while handling both viewpoint and illumination editing from a single image. In this paper, we propose VoRF, a novel approach that can take even a single portrait image as input and relight human heads under novel illuminations that can be viewed from arbitrary viewpoints. VoRF represents a human head as a continuous volumetric field and learns a prior model of human heads using a coordinate-based MLP with individual latent spaces for identity and illumination. The prior model is learned in an auto-decoder manner over a diverse class of head shapes and appearances, allowing VoRF to generalize to novel test identities from a single input image. Additionally, VoRF has a reflectance MLP that uses the intermediate features of the prior model for rendering One-Light-at-A-Time (OLAT) images under novel views. We synthesize novel illuminations by combining these OLAT images with target environment maps. Qualitative and quantitative evaluations demonstrate the effectiveness of VoRF for relighting and novel view synthesis, even when applied to unseen subjects under uncontrolled illumination. This work is an extension of Rao et al. (VoRF: Volumetric Relightable Faces 2022). We provide extensive evaluation and ablative studies of our model and also provide an application, where any face can be relighted using textual input.


Approaching human 3D shape perception with neurally mappable models

August 2023

·

70 Reads

Humans effortlessly infer the 3D shape of objects. What computations underlie this ability? Although various computational models have been proposed, none of them capture the human ability to match object shape across viewpoints. Here, we ask whether and how this gap might be closed. We begin with a relatively novel class of computational models, 3D neural fields, which encapsulate the basic principles of classic analysis-by-synthesis in a deep neural network (DNN). First, we find that a 3D Light Field Network (3D-LFN) supports 3D matching judgments well aligned to humans for within-category comparisons, adversarially-defined comparisons that accentuate the 3D failure cases of standard DNN models, and adversarially-defined comparisons for algorithmically generated shapes with no category structure. We then investigate the source of the 3D-LFN's ability to achieve human-aligned performance through a series of computational experiments. Exposure to multiple viewpoints of objects during training and a multi-view learning objective are the primary factors behind model-human alignment; even conventional DNN architectures come much closer to human behavior when trained with multi-view objectives. Finally, we find that while the models trained with multi-view learning objectives are able to partially generalize to new object categories, they fall short of human alignment. This work provides a foundation for understanding human shape inferences within neurally mappable computational architectures and highlights important questions for future work.




Citations (41)


... Implicit Shape Representations. Implicit shape representations are state-of-the-art in encoding shape geometric details [28,35,33,3,31,44,41]. To improve the shape modeling capability, researchers inject local-aware designs. ...

Reference:

CoFie: Learning Compact Neural Surface Representations with Coordinate Fields
FIRe: Fast Inverse Rendering using Directional and Signed Distance Functions
  • Citing Conference Paper
  • January 2024

... Facial makeup is an important aspect of human appearance. In computer vision and graphics, mainstream research focuses on makeup transfer [7][8][9]13,17,18,22,25,26,28,32,42,43,[50][51][52]62], 3D makeup [16,24,30,39,[56][57][58], and face verification [15,40]. ...

AvatarStudio: Text-Driven Editing of 3D Dynamic Human Head Avatars
  • Citing Article
  • December 2023

ACM Transactions on Graphics

... Light estimation is a distinct research field [13,16,42,57,58,60,62,72]. Some methods represent lighting implicitly [24,69,78], limiting generalizability, while others require multi-view inputs and scene mesh data [42,60,72]. DPI [35] is related to our approach. This method combines differential rendering with Stable Diffusion for high-quality envmap generation but relies on multi-view NeRF methods for mesh reconstruction and lacks material BRDF data, reducing accuracy. ...

Diffusion Posterior Illumination for Ambiguity-Aware Inverse Rendering
  • Citing Article
  • December 2023

ACM Transactions on Graphics

... Feed-forward approaches for 3D reconstruction and rendering aim to generalize across scenes by learning from large datasets. Early works on generalizable NeRFs focus on object-level (Chibane et al., 2021;Johari et al., 2022;Reizenstein et al., 2021;Yu et al., 2021) and scene-level reconstruction (Suhail et al., 2022;Wang et al., 2021;Du et al., 2023;. These methods typically rely on epipolar sampling or cost volumes to fuse multi-view features, requiring extensive point sampling for rendering, which results in slow speed and often unsatisfactory details. ...

Learning to Render Novel Views from Wide-Baseline Stereo Pairs
  • Citing Conference Paper
  • June 2023

... While text-guided image editing demonstrates promising potential [5,7,17,22,32,35], it is constrained by the ambiguity of language instructions and the lack of precise spatial control, e.g., failing to accurately adjust the shape, position, or posture of a human. In contrast, interactive image editing [1,11,25,41,51] offers a more flexible and precise solution, which supports more intuitive operations like drawing sketches, clicking points, and dragging regions. ...

Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold
  • Citing Conference Paper
  • July 2023

... However, these methods do not natively allow for relighting, which is required for accurately compositing the face into different backgrounds or environments. Some extensions [Deng et al. 2023;Jiang et al. 2023;Pan et al. 2022Pan et al. , 2021Ranjan et al. 2023] aim to disentangle the geometry and reflectance of the face from the environmental lighting by implicitly learning a subspace of intrinsic components like albedo, specularity, and normals, but, do not model the light transport accurately enough with ground truth disentangled data. In-the-wild images have a low dynamic range, non-linear photometric effects due to saturation and colored lighting, and different camera response curves. ...

GAN2X: Non-Lambertian Inverse Rendering of Image GANs
  • Citing Conference Paper
  • September 2022

... Compared with neural SDF, these methods result in discontinuous artifacts during the reconstruction. NeRFactor [53], NeRD [3], Neural-PIL [4], NeRV [40], Neural Transfer Field [27], InvRender [54] and TensoIR [15] use a density field as the geometry representation and an environment tensor with Monte Carlo sampling for the light reconstruction. To solve the ambiguity of the base color and environment light, [8,19] show the importance of adding a material prior to inverse rendering. ...

Neural Radiance Transfer Fields for Relightable Novel-View Synthesis with Global Illumination
  • Citing Chapter
  • October 2022

Lecture Notes in Computer Science

... Equipped with the neural scene representation and a differentiable rendering algorithm, 3D-aware GANs can produce multi-view consistent images [10,56,67]. Several approaches [9, 26, 51, 75, 80] adopt a two-stage rendering process, which leverages convolution neural networks (CNNs) to increase the resolution of the image or neural rendering features, to generate 3D-aware images at higher resolution efficiently. ...

Disentangled3D: Learning a 3D Generative Model with Disentangled Geometry and Appearance from Monocular Images
  • Citing Conference Paper
  • June 2022

... Given the bounding box of the entire scene, our approach first partitions the scene into several sub-regions, iterates through each sub-region, and uses location-specific sampling within the sub-region. To achieve this, we follow the concept of the Neural Ground Plan (NGP) [48] that assumes the scene can be represented as a flat surface and partition it into a twodimensional grid from a top-down perspective. Instead of optimizing the whole scene, ours only optimizes a certain subregion at each time with the location-specific sampling technique, which largely reduces the sampling range and thus improves the training efficiency. ...

Seeing 3D Objects in a Single Image via Self-Supervised Static-Dynamic Disentanglement
  • Citing Preprint
  • July 2022

... Implicit neural representation models 3D scenes as differentiable continuous neural networks (Tewari et al., 2022). NeRF (Mildenhall et al., 2020) learns density and radiance field values of the scene supervised by 2D images. ...

Advances in Neural Rendering