PreprintPDF Available

Montage4D: Real-time Seamless Fusion and Stylization of Multiview Video Textures

Preprints and early-stage research may not have been peer reviewed yet.


The commoditization of virtual and augmented reality devices and the availability of inexpensive consumer depth cameras have catalyzed a resurgence of interest in spatiotemporal performance capture. Recent systems like Fusion4D and Holoportation address several crucial problems in the real-time fusion of multiview depth maps into volumetric and deformable representations. Nonetheless, stitching multiview video textures onto dynamic meshes remains challenging due to imprecise geometries, occlusion seams, and critical time constraints. In this paper, we present a practical solution for real-time seamless texture montage for dynamic multiview reconstruction. We build on the ideas of dilated depth discontinuities and majority voting from Holoportation to reduce ghosting effects when blending textures. In contrast to that approach, we determine the appropriate blend of textures per vertex using view-dependent rendering techniques, so as to avert fuzziness caused by the ubiquitous normal-weighted blending. By leveraging geodesics-guided diffusion and temporal texture fields, our algorithm mitigates spatial occlusion seams while preserving temporal consistency. Experiments demonstrate significant enhancement in rendering quality, especially in detailed regions such as faces. Furthermore, we present our preliminary exploration of real-time stylization and relighting to empower Holoportation users to interactively stylize live 3D content. We envision a wide range of applications for Montage4D, including immersive telepresence for business, training, and live entertainment.
... However, these approaches often produce ghosting artifacts due to geometry reconstruction errors, especially near occlusion boundaries. Several approaches seek to reduce these artifacts using optical flow correction [Casas et al. 2015;Du et al. 2019;Eisemann et al. 2008], per-view refinements that align geometric and image boundaries [Chaurasia et al. 2013;Hedman et al. 2018Hedman et al. , 2016Xu et al. 2021], or soft scene forms [Penner and Zhang 2017]. To further reduce ghosting and aliasing artifacts, DeepBlending [Hedman et al. 2018] proposed to train a CNN to predict adaptive per-pixel blending weights, and Xu et al. [2021] employed a post-processing network to perform temporal super-sampling. ...
Full-text available
We propose a scalable neural scene reconstruction and rendering method to support distributed training and interactive rendering of large indoor scenes. Our representation is based on tiles. Tile appearances are trained in parallel through a background sampling strategy that augments each tile with distant scene information via a proxy global mesh. Each tile has two low-capacity MLPs: one for view-independent appearance (diffuse color and shading) and one for view-dependent appearance (specular highlights, reflections). We leverage the phenomena that complex view-dependent scene reflections can be attributed to virtual lights underneath surfaces at the total ray distance to the source. This lets us handle sparse samplings of the input scene where reflection highlights do not always appear consistently in input images. We show interactive free-viewpoint rendering results from five scenes, one of which covers an area of more than 100 m ² . Experimental results show that our method produces higher-quality renderings than a single large-capacity MLP and five recent neural proxy-geometry and voxel-based baseline methods. Our code and data are available at project webpage
ResearchGate has not been able to resolve any references for this publication.