Roman Shapovalov’s research while affiliated with Meta and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (20)


Figure 2. Overview of Twinner. The input views are processed together with their foreground masks and camera poses by an image tokenizer. The resulting tokens are processed by a Diffusion Transformer (DiT) model to predict a 3D volumetric representation of the scene, and a second DiT model to predict the scene's illumination. The 3D representation is then rendered to obtain images of the scene's material properties, normals, and opacity from the target's point of view. Using these rendered images together with the predicted illumination, Twinner renders an approximate physically-shaded image of the scene which then enters a photometric loss.
Figure 4. Examples from our procedural dataset visualizing the shaded image, environment map, materials, and normals from our environment-map-endowed version of Zeroverse [57].
Twinner: Shining Light on Digital Twins in a Few Snaps
  • Preprint
  • File available

March 2025

·

5 Reads

Jesus Zarzar

·

Tom Monnier

·

Roman Shapovalov

·

[...]

·

We present the first large reconstruction model, Twinner, capable of recovering a scene's illumination as well as an object's geometry and material properties from only a few posed images. Twinner is based on the Large Reconstruction Model and innovates in three key ways: 1) We introduce a memory-efficient voxel-grid transformer whose memory scales only quadratically with the size of the voxel grid. 2) To deal with scarcity of high-quality ground-truth PBR-shaded models, we introduce a large fully-synthetic dataset of procedurally-generated PBR-textured objects lit with varied illumination. 3) To narrow the synthetic-to-real gap, we finetune the model on real life datasets by means of a differentiable physically-based shading model, eschewing the need for ground-truth illumination or material properties which are challenging to obtain in real life. We demonstrate the efficacy of our model on the real life StanfordORB benchmark where, given few input views, we achieve reconstruction quality significantly superior to existing feedforward reconstruction networks, and comparable to significantly slower per-scene optimization methods.

Download

Figure 2. Statistics of uCO3D. (Left) We plot the number of objects per super-category. In total, the dataset contains 50 super-categories, each gathering around 20 sub-categories. (Right) We show a word cloud of all 1,070 visual categories represented in the dataset.
Figure 5. 3D reconstruction comparison. We show results of LightplaneLRM [6] models trained on MVImgNet, CO3Dv2 and uCO3D.
Overview of 3D object datasets. We compare the num- ber of objects / classes, the type of data and associated annotations.
UnCommon Objects in 3D

January 2025

·

38 Reads

We introduce Uncommon Objects in 3D (uCO3D), a new object-centric dataset for 3D deep learning and 3D generative AI. uCO3D is the largest publicly-available collection of high-resolution videos of objects with 3D annotations that ensures full-360^{\circ} coverage. uCO3D is significantly more diverse than MVImgNet and CO3Dv2, covering more than 1,000 object categories. It is also of higher quality, due to extensive quality checks of both the collected videos and the 3D annotations. Similar to analogous datasets, uCO3D contains annotations for 3D camera poses, depth maps and sparse point clouds. In addition, each object is equipped with a caption and a 3D Gaussian Splat reconstruction. We train several large 3D models on MVImgNet, CO3Dv2, and uCO3D and obtain superior results using the latter, showing that uCO3D is better for learning applications.


PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models

December 2024

·

13 Reads

Text- or image-to-3D generators and 3D scanners can now produce 3D assets with high-quality shapes and textures. These assets typically consist of a single, fused representation, like an implicit neural field, a Gaussian mixture, or a mesh, without any useful structure. However, most applications and creative workflows require assets to be made of several meaningful parts that can be manipulated independently. To address this gap, we introduce PartGen, a novel approach that generates 3D objects composed of meaningful parts starting from text, an image, or an unstructured 3D object. First, given multiple views of a 3D object, generated or rendered, a multi-view diffusion model extracts a set of plausible and view-consistent part segmentations, dividing the object into parts. Then, a second multi-view diffusion model takes each part separately, fills in the occlusions, and uses those completed views for 3D reconstruction by feeding them to a 3D reconstruction network. This completion process considers the context of the entire object to ensure that the parts integrate cohesively. The generative completion model can make up for the information missing due to occlusions; in extreme cases, it can hallucinate entirely invisible parts based on the input 3D asset. We evaluate our method on generated and real 3D assets and show that it outperforms segmentation and part-extraction baselines by a large margin. We also showcase downstream applications such as 3D part editing.



Meta 3D Gen

July 2024

·

20 Reads

We introduce Meta 3D Gen (3DGen), a new state-of-the-art, fast pipeline for text-to-3D asset generation. 3DGen offers 3D asset creation with high prompt fidelity and high-quality 3D shapes and textures in under a minute. It supports physically-based rendering (PBR), necessary for 3D asset relighting in real-world applications. Additionally, 3DGen supports generative retexturing of previously generated (or artist-created) 3D shapes using additional textual inputs provided by the user. 3DGen integrates key technical components, Meta 3D AssetGen and Meta 3D TextureGen, that we developed for text-to-3D and text-to-texture generation, respectively. By combining their strengths, 3DGen represents 3D objects simultaneously in three ways: in view space, in volumetric space, and in UV (or texture) space. The integration of these two techniques achieves a win rate of 68% with respect to the single-stage model. We compare 3DGen to numerous industry baselines, and show that it outperforms them in terms of prompt fidelity and visual quality for complex textual prompts, while being significantly faster.


Meta 3D AssetGen: Text-to-Mesh Generation with High-Quality Geometry, Texture, and PBR Materials

July 2024

·

11 Reads

We present Meta 3D AssetGen (AssetGen), a significant advancement in text-to-3D generation which produces faithful, high-quality meshes with texture and material control. Compared to works that bake shading in the 3D object's appearance, AssetGen outputs physically-based rendering (PBR) materials, supporting realistic relighting. AssetGen generates first several views of the object with factored shaded and albedo appearance channels, and then reconstructs colours, metalness and roughness in 3D, using a deferred shading loss for efficient supervision. It also uses a sign-distance function to represent 3D shape more reliably and introduces a corresponding loss for direct shape supervision. This is implemented using fused kernels for high memory efficiency. After mesh extraction, a texture refinement transformer operating in UV space significantly improves sharpness and details. AssetGen achieves 17% improvement in Chamfer Distance and 40% in LPIPS over the best concurrent work for few-view reconstruction, and a human preference of 72% over the best industry competitors of comparable speed, including those that support PBR. Project page with generated assets: https://assetgen.github.io



Replay: Multi-modal Multi-view Acted Videos for Casual Holography

July 2023

·

12 Reads

We introduce Replay, a collection of multi-view, multi-modal videos of humans interacting socially. Each scene is filmed in high production quality, from different viewpoints with several static cameras, as well as wearable action cameras, and recorded with a large array of microphones at different positions in the room. Overall, the dataset contains over 4000 minutes of footage and over 7 million timestamped high-resolution frames annotated with camera poses and partially with foreground masks. The Replay dataset has many potential applications, such as novel-view synthesis, 3D reconstruction, novel-view acoustic synthesis, human body and face analysis, and training generative models. We provide a benchmark for training and evaluating novel-view synthesis, with two scenarios of different difficulty. Finally, we evaluate several baseline state-of-the-art methods on the new benchmark.




Citations (7)


... However, with the development of camera equipment, despite many efforts using monocular handheld cameras or fixed camera arrays to build datasets, none of these solutions fully address all four requirements. For example, Panoptic Sports [19], ZJU-Mocap [44] and DiVa-360 [37] achieved human/object-centric 360°capture but lack complex backgrounds and multimodal data; Google's immersive light field [6] achieves 6-DoF volumetric video in an insidelooking-out manner but lacks multimodality and high frame rate, long-duration dynamic content; Replay [49] introduces sound, but the camera setup is not suitable for human viewing habits thus cannot achieve high-quality 6-DoF interaction. Both academia and industry require an appropriate capture method to support the development of immersive volumetric video technology. ...

Reference:

ImViD: Immersive Volumetric Videos for Enhanced VR Engagement
Replay: Multi-modal Multi-view Acted Videos for Casual Holography
  • Citing Conference Paper
  • October 2023

... Additional supervising signals such as depth [10] or 4D mesh information [9] are occasionally provided. Many of the existing dynamic data collections are human [8,9,38,121] or animal-centered [89], mainly focusing on human poses [101]. Such human-related data also exists from an ego perspective [21,100]. ...

Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories
  • Citing Conference Paper
  • June 2023

... Stereo-Energy scales the stereo input audio to match with the target. We compare INRAS [45], NAF [28], ViGAS [7], AV-NeRF [25], AV-GS [5] for the neural network-based methods. Compared to AV-NeRF, our model achieves substantial improvements in MAG and ENV, with a 9.7% improvement in MAG and 3.4% improvement in ENV, demonstrating its effectiveness in utilizing surface-integrated geometry. ...

Novel-View Acoustic Synthesis
  • Citing Conference Paper
  • June 2023

... These methods learn an explicit template shape, the deformation skeleton and skinning weights. Orthogonal to this, NPG [11] and KeyTr [49] represent shapes as basis points and coefficients learned from videos. Despite impressive results, these methods assume single object and cannot handle the compositional shape during human object interaction. ...

KeyTr: Keypoint Transporter for 3D Reconstruction of Deformable Objects in Videos
  • Citing Conference Paper
  • June 2022

... In DensePose [29,73], 2D keypoint estimation is generalized to a continuous representation, where arbitrary surface points can be regression targets, instead of having a fixed set of sparse keypoints throughout the dataset. DensePose3D [93] shows how 2D dense maps can be used to lift estimates to 3D. Similar in spirit, DenseBody [113,120] predicts 3D mesh points on a 2D UV-map. ...

DensePose 3D: Lifting Canonical Surface Maps of Articulated Objects to the Third Dimension
  • Citing Conference Paper
  • October 2021

... Datasets. We use 19 publicly available datasets from the training sets of the three teachers as distillation data, leading to around 20.7M images in total: ImageNet-19K [17] (2021 release [68]), Mapillary [64] and Google Landmarks v2 [65] from DINO-v2, AGORA [39], BEDLAM [7], UBody [20,32] and CUFFS [4] from Multi-HMR; Habitat [36], ARKitScenes [16], Blended MVS [69], MegaDepth [31], ScanNet++ [70], CO3D-v2 [43], Mapfree [3], WildRgb [1], VirtualKitti [9], Unreal4K [57], TartanAir [63] and DL3DV [33] from MASt3R. We only use the images of those datasets and discard all annotations. ...

Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction
  • Citing Conference Paper
  • October 2021

... Given a 3D cell volume V ∈ R H×W ×C , where H, W , and C represent the X, Y, and Z axes, respectively, with H and W corresponding to high-resolution dimensions and C corresponding to a low-resolution dimension, we extract three slices of the volume: the XY, XZ, and YZ plane slices, denoted as I t xy , I t xz , and I t yz . Following the approach outlined in [9,13], we first project the 3D coordinates x V hwc of each grid element (h, w, c) from the 3D cell volume onto each corresponding XY plane slice I xy . A fixed ResNet32 is then applied to extract the 2D features f t hwc for each slice, with t indicating the specific XY plane slice. ...

Unsupervised Learning of 3D Object Categories from Videos in the Wild
  • Citing Conference Paper
  • June 2021