December 2024
·
1 Read
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
December 2024
·
1 Read
December 2024
December 2024
·
3 Reads
December 2024
·
2 Reads
November 2024
·
7 Reads
This paper proposes a concise, elegant, and robust pipeline to estimate smooth camera trajectories and obtain dense point clouds for casual videos in the wild. Traditional frameworks, such as ParticleSfM~\cite{zhao2022particlesfm}, address this problem by sequentially computing the optical flow between adjacent frames to obtain point trajectories. They then remove dynamic trajectories through motion segmentation and perform global bundle adjustment. However, the process of estimating optical flow between two adjacent frames and chaining the matches can introduce cumulative errors. Additionally, motion segmentation combined with single-view depth estimation often faces challenges related to scale ambiguity. To tackle these challenges, we propose a dynamic-aware tracking any point (DATAP) method that leverages consistent video depth and point tracking. Specifically, our DATAP addresses these issues by estimating dense point tracking across the video sequence and predicting the visibility and dynamics of each point. By incorporating the consistent video depth prior, the performance of motion segmentation is enhanced. With the integration of DATAP, it becomes possible to estimate and optimize all camera poses simultaneously by performing global bundle adjustments for point tracking classified as static and visible, rather than relying on incremental camera registration. Extensive experiments on dynamic sequences, e.g., Sintel and TUM RGBD dynamic sequences, and on the wild video, e.g., DAVIS, demonstrate that the proposed method achieves state-of-the-art performance in terms of camera pose estimation even in complex dynamic challenge scenes.
November 2024
·
5 Reads
ACM Transactions on Graphics
High-quality real-time rendering using user-affordable capture rigs is an essential property of human performance capture systems for real-world applications. However, state-of-the-art performance capture methods may not yield satisfactory rendering results under a very sparse (e.g., four) capture setting. Specifically, neural radiance field (NeRF)-based methods and 3D Gaussian Splatting (3DGS)-based methods tend to produce local geometry errors for unseen performers, while occupancy field (PIFu)-based methods often produce unrealistic rendering results. In this paper, we propose a novel generalizable neural approach to reconstruct and render the performers from very sparse RGBD streams in high quality. The core of our method is a novel point-based generalizable human (PGH) representation conditioned on the pixel-aligned RGBD features. The PGH representation learns a surface implicit function for the regression of surface points and a Gaussian implicit function for parameterizing the radiance fields of the regressed surface points with 2D Gaussian surfels, and uses surfel splatting for fast rendering. We learn this hybrid human representation via two novel networks. First, we propose a novel point-regressing network (PRNet) with a depth-guided point cloud initialization (DPI) method to regress an accurate surface point cloud based on the denoised depth information. Second, we propose a novel neural blending-based surfel splatting network (SPNet) to render high-quality geometries and appearances in novel views based on the regressed surface points and high-resolution RGBD features of adjacent views. Our method produces free-view human performance videos of 1K resolution at 12 fps on average. Experiments on two benchmarks show that our method outperforms state-of-the-art human performance capture methods.
November 2024
·
1 Read
ACM Transactions on Graphics
This paper aims to address the challenge of reconstructing long volumetric videos from multi-view RGB videos. Recent dynamic view synthesis methods leverage powerful 4D representations, like feature grids or point cloud sequences, to achieve high-quality rendering results. However, they are typically limited to short (1~2s) video clips and often suffer from large memory footprints when dealing with longer videos. To solve this issue, we propose a novel 4D representation, named Temporal Gaussian Hierarchy, to compactly model long volumetric videos. Our key observation is that there are generally various degrees of temporal redundancy in dynamic scenes, which consist of areas changing at different speeds. Motivated by this, our approach builds a multi-level hierarchy of 4D Gaussian primitives, where each level separately describes scene regions with different degrees of content change, and adaptively shares Gaussian primitives to represent unchanged scene content over different temporal segments, thus effectively reducing the number of Gaussian primitives. In addition, the tree-like structure of the Gaussian hierarchy allows us to efficiently represent the scene at a particular moment with a subset of Gaussian primitives, leading to nearly constant GPU memory usage during the training or rendering regardless of the video length. Moreover, we design a Compact Appearance Model that mixes diffuse and view-dependent Gaussians to further minimize the model size while maintaining the rendering quality. We also develop a rasterization pipeline of Gaussian primitives based on the hardware-accelerated technique to improve rendering speed. Extensive experimental results demonstrate the superiority of our method over alternative methods in terms of training cost, rendering speed, and storage usage. To our knowledge, this work is the first approach capable of efficiently handling hours of volumetric video data while maintaining state-of-the-art rendering quality.
November 2024
·
41 Reads
ACM Transactions on Graphics
Unbiased Monte Carlo path tracing that is extensively used in realistic rendering produces undesirable noise, especially with low samples per pixel (spp). Recently, several methods have coped with this problem by importing unbiased noisy images and auxiliary features to neural networks to either predict a fixed-sized kernel for convolution or directly predict the denoised result. Since it is impossible to produce arbitrarily high spp images as the training dataset, the network-based denoising fails to produce high-quality images under high spp. More specifically, network-based denoising is inconsistent and does not converge to the ground truth as the sampling rate increases. On the other hand, the post-correction estimators yield a blending coefficient for a pair of biased and unbiased images influenced by image errors or variances to ensure the consistency of the denoised image. As the sampling rate increases, the blending coefficient of the unbiased image converges to 1, that is, using the unbiased image as the denoised results. However, these estimators usually produce artifacts due to the difficulty of accurately predicting image errors or variances with low spp. To address the above problems, we take advantage of both kernel-predicting methods and post-correction denoisers. A novel kernel-based denoiser is proposed based on distribution-free kernel regression consistency theory, which does not explicitly combine the biased and unbiased results but constrain the kernel bandwidth to produce consistent results under high spp. Meanwhile, our kernel regression method explores bandwidth optimization in the robust auxiliary feature space instead of the noisy image space. This leads to consistent high-quality denoising at both low and high spp. Experiment results demonstrate that our method outperforms existing denoisers in accuracy and consistency.
November 2024
·
6 Reads
November 2024
·
8 Reads
In this paper, we propose a novel multi-view stereo (MVS) framework that gets rid of the depth range prior. Unlike recent prior-free MVS methods that work in a pair-wise manner, our method simultaneously considers all the source images. Specifically, we introduce a Multi-view Disparity Attention (MDA) module to aggregate long-range context information within and across multi-view images. Considering the asymmetry of the epipolar disparity flow, the key to our method lies in accurately modeling multi-view geometric constraints. We integrate pose embedding to encapsulate information such as multi-view camera poses, providing implicit geometric constraints for multi-view disparity feature fusion dominated by attention. Additionally, we construct corresponding hidden states for each source image due to significant differences in the observation quality of the same pixel in the reference frame across multiple source frames. We explicitly estimate the quality of the current pixel corresponding to sampled points on the epipolar line of the source image and dynamically update hidden states through the uncertainty estimation module. Extensive results on the DTU dataset and Tanks&Temple benchmark demonstrate the effectiveness of our method. The code is available at our project page: https://zju3dv.github.io/GD-PoseMVS/.
... By leveraging 3DGS, dense visual SLAM can achieve improved rendering performance and trajectory accuracy, both essential for immersive XR experiences that depend on accurate spatial understanding [74], [76], [77]. Moreover, incorporating semantic and uncertaintyaware processing into 3DGS enhances robustness in dynamic environments, which can be further used in advanced XR applications like augmented reality navigation and interactive scanning [73], [75]. In medical XR, 3DGS leads to highquality and real-time 3D reconstructions and visualizations of anatomical structures, which are crucial for medical diagnostics, surgical planning, and immersive training simulations. ...
October 2024
... To enhance multi-view consistency, methods such as 2DGS [32] compress the 3D volume into planar Gaussian disks, while GOF [33] uses a Gaussian opacity field to extract geometry through level sets directly. PGSR [34] transforms Gaussian shapes into planar forms, optimizing them for realistic surface representation and simplifying parameter calculations, such as normals and distance measures. Nonetheless, achieving consistent geometry across multiple views remains a challenge in 3DGS approaches. ...
January 2024
IEEE Transactions on Visualization and Computer Graphics
... Pose-Dependent Human Avatars. More recently, a number of 3DGS-based approaches have been proposed for creating animatable avatars from monocular [2,3,9] and multiview [6,12,19,24,32] video. GaussianAvatar [2] uses pixel-aligned features for optimizing 3D Gaussians from monocular video. ...
June 2024
... In the realm of human-scene interaction, kinematics-based humanoid controllers have proven effective in executing interactive motions such as sitting, opening doors, and carrying objects [Starke et al. 2019;. By incorporating generative frameworks such as conditional Variational Autoencoders (cVAE) [Hassan et al. 2021], Diffusion models [Cen et al. 2024;Pi et al. 2023], or by further learning a Reinforcement Learning (RL) policy to control the VAE's learned latent space [Luo et al. 2023;Zhao et al. 2023a], a more diverse range of interaction behaviors can be synthesized. In the application of physics simulation, diverse interaction motions such as catching and carrying, coupled with locomotion skills, have been realized by distilling a variety of expert demonstrations and RL-based control with visual input [Merel et al. 2020]. ...
June 2024
... Although they are well-suited for pose-driven animation, they often struggle with per-frame non-rigid warping in scenarios with complex garments. Other approaches [33,51] utilize temporal embeddings instead of SMPL poses to model each frame independently, achieving high-quality rendering but making it challenging to animate avatars. ...
June 2024
... Recently, deep visual SLAM and SfM systems have emerged that adopt deep neural networks to estimate pairwise or long-term correspondences [2,7,18,19,21,55,58,60,61,64,66,74], to reconstruct radiance fields [11,34,42] or global 3D point clouds [28,67]. While these methods demonstrate accurate camera tracking and reconstruction, they typically assume predominantly static scenes and sufficient camera baselines between frames. ...
June 2024
... Moreover, due to the discrete nature of the 3D Gaussian representation, the 3DGS-represented head can not be directly edited in the UV texture space, just like polygon mesh models. Previous editable methods [3,22,50] reply on extensive optimization with pre-trained diffusion models, such as Instruct-Pix2Pix [5], which is both time-consuming and uncontrollable. Although some prior methods [1,36,49,56,65] also structure Gaussian points into the UV space, our experiments reveal that their reconstructed textures are discontinuous in the UV domain. ...
June 2024
... Text2Room [14] proposes a warping and inpainting methodology for mesh population and scene creation, while Text2NeRF [59] shifts away from mesh-based reconstruction to utilize radiance fields as scene generation priors. Although these methods are initially constrained to camera-centric scenes, subsequent work [60] expands capabilities to support general 3D scene generation with arbitrary 6 degree-of-freedom (DOF) camera trajectories. However, these approaches remain limited to static scenes, lacking the ability to incorporate motion, which is a crucial element for representing dynamic, real-world environments. ...
June 2024
... Local feature matching [1,2] is a fundamental problem in the field of computer vision and plays a significant role in downstream applications, including but not limited to SLAM [3][4][5][6][7][8], 3D reconstruction [9,10], visual localization [11][12][13], and object pose estimation [14,15]. However, traditional CNN-based methods [16,17] often fail under extreme conditions due to the lack of a global receptive field, thus meeting failure under dramatic changes in scale, illumination, viewpoint, or weakly-textured scenes. ...
September 2024
Communications Engineering
... In contrast, PanopticFusion [35] combines predicted instances and class labels (including background) to generate pixelwise panoptic predictions, which are then integrated into a 3D mesh. More recent works, such as those by Menini et al. [32] and ALSTER [55], jointly reconstruct geometry and semantics in a SLAM framework Additionally, NIS-SLAM [61] trains a multi-resolution tetrahedron NeRF to encode color, depth and semantics. NEDS-SLAM [21] is a 3DGS-based SLAM system with embedded semantic features to learn an additional semantic representation of a closed set of classes. ...
September 2024
IEEE Transactions on Visualization and Computer Graphics