Article

NeRF: representing scenes as neural radiance fields for view synthesis

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully connected (nonconvolutional) deep network, whose input is a single continuous 5D coordinate (spatial location ( x , y , z ) and viewing direction ( θ, ϕ )) and whose output is the volume density and view-dependent emitted radiance at that spatial location. We synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. We describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Recent works in large scene reconstruction [22,28,32,47,49,51] mostly take radiance fields as the basic 3D representation, e.g., Neural Radiance Fields (NeRF) [33] and 3D Gaussian Splatting (3DGS) [21]. NeRF-based approaches struggle to scale for large scenes with rich details, as their implicit representations demand substantial resources for both training and rendering. ...
... Traditional methods use structure-from-motion (SfM) algorithms to generate sparse point clouds of the scene or further extract dense point clouds and meshes via multiview stereo methods [1,14,15,17,25,40,44,46,56]. Recently, Neural Radiance Fields (NeRF) [33,47,51,55] and 3D Gaussian Splatting (3DGS) [6,21,28,29,31,32,50,54] have been widely applied to large-scale scene reconstruction, as they outperform point clouds and meshes for novel view synthesis. A divide-and-conquer strategy is commonly applied for both NeRF and 3DGS methods. ...
... The feature matrix X = [x 1 , x 2 , . . . , x n ] ⊤ ∈ R n×d contains the 3D coordinates of each camera with positional encoding [33]. Intuitively, occluded or distant cameras typically share minimal overlapped views, which can be distinguished in our occlusion-aware view graph. ...
Preprint
Full-text available
In large-scale scene reconstruction using 3D Gaussian splatting, it is common to partition the scene into multiple smaller regions and reconstruct them individually. However, existing division methods are occlusion-agnostic, meaning that each region may contain areas with severe occlusions. As a result, the cameras within those regions are less correlated, leading to a low average contribution to the overall reconstruction. In this paper, we propose an occlusion-aware scene division strategy that clusters training cameras based on their positions and co-visibilities to acquire multiple regions. Cameras in such regions exhibit stronger correlations and a higher average contribution, facilitating high-quality scene reconstruction. We further propose a region-based rendering technique to accelerate large scene rendering, which culls Gaussians invisible to the region where the viewpoint is located. Such a technique significantly speeds up the rendering without compromising quality. Extensive experiments on multiple large scenes show that our method achieves superior reconstruction results with faster rendering speed compared to existing state-of-the-art approaches. Project page: https://occlugaussian.github.io.
... In recent years, there has been a surge in the use of radiance field approaches, such as Neural Radiance Fields (NeRF) (Mildenhall et al. 2020), for view synthesis. To extend the practical applications of radiance fields, various enhancements have been introduced, including methods aimed at increasing processing efficiency (Chen et al. 2021;Neff et al. 2021;Yu et al. 2021;Kurz et al. 2022) and enabling image manipulation capabilities (Lin et al. 2021;Zhang et al. 2021a;Wang et al. 2023;Kuang et al. 2023). ...
... Neural Radiance Fields. The original NeRF, introduced by Mildenhall et al. (Mildenhall et al. 2020), represents a scene as a continuous 5D function that maps spatial coordinates and viewing directions to radiance values. Since its introduction, NeRF-related techniques have found applications in various computer vision tasks (Zhang et al. 2021b;Chen et al. 2022;Azinović et al. 2022;Liu et al. 2024;Chen et al. 2023). ...
... We conducted experiments following the implementation and settings of our baseline methods, NeRF (Mildenhall et al. 2020) and 2DGS . The experiments are conducted on NVIDIA 3090 GPUs and the Adam optimizer (Kingma and Ba 2015) is employed to optimize the radiance field. ...
Preprint
Full-text available
Recent methods, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have demonstrated remarkable capabilities in novel view synthesis. However, despite their success in producing high-quality images for viewpoints similar to those seen during training, they struggle when generating detailed images from viewpoints that significantly deviate from the training set, particularly in close-up views. The primary challenge stems from the lack of specific training data for close-up views, leading to the inability of current methods to render these views accurately. To address this issue, we introduce a novel pseudo-label-based learning strategy. This approach leverages pseudo-labels derived from existing training data to provide targeted supervision across a wide range of close-up viewpoints. Recognizing the absence of benchmarks for this specific challenge, we also present a new dataset designed to assess the effectiveness of both current and future methods in this area. Our extensive experiments demonstrate the efficacy of our approach.
... Novel view synthesis for dynamic scenes allows for the creation of realistic representations of 4D environments, which is essential in fields like computer vision, virtual reality, and augmented reality. Traditionally, this area has been led by neural radiance fields (NeRF) [2,12,18,21,25], which model opacity and color over time to depict dynamic scenes. While effective, these NeRF-based methods come with high training and rendering costs, limiting their practicality, especially in real-time applications and on devices with limited resources. ...
... Recently, neural radiance fields(NeRF) [25] have achieved encouraging results in novel view synthesis. NeRF [25] represents the scene by mapping 3D coordinates and view dependency to color and opacity. ...
... Recently, neural radiance fields(NeRF) [25] have achieved encouraging results in novel view synthesis. NeRF [25] represents the scene by mapping 3D coordinates and view dependency to color and opacity. Since NeRF [25] requires sampling each ray by querying the MLP for hundreds of points, this significantly limits the training and rendering speed. ...
Preprint
Full-text available
4D Gaussian Splatting (4DGS) has recently gained considerable attention as a method for reconstructing dynamic scenes. Despite achieving superior quality, 4DGS typically requires substantial storage and suffers from slow rendering speed. In this work, we delve into these issues and identify two key sources of temporal redundancy. (Q1) \textbf{Short-Lifespan Gaussians}: 4DGS uses a large portion of Gaussians with short temporal span to represent scene dynamics, leading to an excessive number of Gaussians. (Q2) \textbf{Inactive Gaussians}: When rendering, only a small subset of Gaussians contributes to each frame. Despite this, all Gaussians are processed during rasterization, resulting in redundant computation overhead. To address these redundancies, we present \textbf{4DGS-1K}, which runs at over 1000 FPS on modern GPUs. For Q1, we introduce the Spatial-Temporal Variation Score, a new pruning criterion that effectively removes short-lifespan Gaussians while encouraging 4DGS to capture scene dynamics using Gaussians with longer temporal spans. For Q2, we store a mask for active Gaussians across consecutive frames, significantly reducing redundant computations in rendering. Compared to vanilla 4DGS, our method achieves a 41×41\times reduction in storage and 9×9\times faster rasterization speed on complex dynamic scenes, while maintaining comparable visual quality. Please see our project page at https://4DGS-1K.github.io.
... In 2020, Mildenhall et al. [23] introduced Neural Radiance Fields (NeRF) for novel view synthesis. NeRFs combine the function approximation capability of multi layer perceptrons (MLP) with principles of volume rendering. ...
... NeRFs learn an unrestricted volumetric 3D representation and the view dependent appearance of an object or a scene from 2D images. Furthermore, they utilize volume rendering principles like ray marching to render photorealistic views from this representation [23,24]. ...
... Table 4 presents quantitative performance metrics of our Radiance Fields on real world test images. Other Radiance Field frameworks [23,34,26] achieve similar results on synthetic and real-world datasets of comparable complexity. Therefore, the quantitative results indicate successful 3D representation learning as well. ...
Preprint
Full-text available
3D detection is a critical task to understand spatial characteristics of the environment and is used in a variety of applications including robotics, augmented reality, and image retrieval. Training performant detection models require diverse, precisely annotated, and large scale datasets that involve complex and expensive creation processes. Hence, there are only few public 3D datasets that are additionally limited in their range of classes. In this work, we propose a pipeline for automatic generation of 3D datasets for arbitrary objects. By utilizing the universal 3D representation and rendering capabilities of Radiance Fields, our pipeline generates high quality 3D models for arbitrary objects. These 3D models serve as input for a synthetic dataset generator. Our pipeline is fast, easy to use and has a high degree of automation. Our experiments demonstrate, that 3D pose estimation networks, trained with our generated datasets, archive strong performance in typical application scenarios.
... To tackle these challenges, recent works such as LERF (Kerr et al., 2023) have embedded CLIP features within 3D representations like Neural Radiance Fields (NeRF) (Mildenhall et al., 2020). These methods aim to bridge 2D VLMs with 3D scene understanding by enabling open-vocabulary querying across 3D spaces. ...
... Neural Radiance Fields (NeRFs) (Mildenhall et al., 2020) represent 3D geometry and appearance with a continuous implicit radiance field, parameterized by a multilayer perceptron (MLP). They also provide a flexible framework for integrating 2D-based information directly into 3D, supporting complex semantic and spatial tasks. ...
... LERF integrates CLIP embeddings into a 3D NeRF framework, enabling open-vocabulary scene understanding by grounding semantic language features spatially across the 3D field. Unlike standard NeRF outputs (Mildenhall et al., 2020;Barron et al., 2021), LERF introduces a dedicated language field, which leverages multi-scale CLIP embeddings to capture semantic information across varying levels of detail. This language field is represented by F lang (x, s), where x is the 3D position and s is the scale. ...
Preprint
Open-vocabulary segmentation, powered by large visual-language models like CLIP, has expanded 2D segmentation capabilities beyond fixed classes predefined by the dataset, enabling zero-shot understanding across diverse scenes. Extending these capabilities to 3D segmentation introduces challenges, as CLIP's image-based embeddings often lack the geometric detail necessary for 3D scene segmentation. Recent methods tend to address this by introducing additional segmentation models or replacing CLIP with variations trained on segmentation data, which lead to redundancy or loss on CLIP's general language capabilities. To overcome this limitation, we introduce SPNeRF, a NeRF based zero-shot 3D segmentation approach that leverages geometric priors. We integrate geometric primitives derived from the 3D scene into NeRF training to produce primitive-wise CLIP features, avoiding the ambiguity of point-wise features. Additionally, we propose a primitive-based merging mechanism enhanced with affinity scores. Without relying on additional segmentation models, our method further explores CLIP's capability for 3D segmentation and achieves notable improvements over original LERF.
... This paper proposes MultiBARF, a method to synthesize pairs of two different sensor images and depth images (2D rendered 3D shape) at assigned viewpoints by only inputting multiview images of two sensors into the model. Neural Radiance Fields (NeRF) [6] is a successful deep learning-based 3D representation for photorealistic scenes. NeRF records 3D models as color and density distributions by a Deep Neural Network(DNN). ...
... Neural Radiance Fields (NeRF). Mildenhall et al. [6] proposed the novel view synthesis method that represents photorealistic continuous scenes by "Neural Radiance Fields" (NeRF). Unlike conventional 3D representations such as voxel or mesh, NeRF expresses 3D scenes or objects by continuous functions using Deep Neural Networks (DNNs). ...
... NeRF and BARF. NeRF [6] is a novel view synthesis method that represents photorealistic scenes by following steps. First, it queries vectors storing spatial location and viewing direction along camera rays. ...
Preprint
Full-text available
Optical sensor applications have become popular through digital transformation. Linking observed data to real-world locations and combining different image sensors is essential to make the applications practical and efficient. However, data preparation to try different sensor combinations requires high sensing and image processing expertise. To make data preparation easier for users unfamiliar with sensing and image processing, we have developed MultiBARF. This method replaces the co-registration and geometric calibration by synthesizing pairs of two different sensor images and depth images at assigned viewpoints. Our method extends Bundle Adjusting Neural Radiance Fields(BARF), a deep neural network-based novel view synthesis method, for the two imagers. Through experiments on visible light and thermographic images, we demonstrate that our method superimposes two color channels of those sensor images on NeRF.
... A key design decision is the choice of 3D scene representation to establish correspondences and capture structure. Popular representations include multi-plane images [4,13,27,65,71,72,82], neural fields [40,56], voxel grids [41,50,61], and Gaussian splatting [22]. NVS methods can be broadly divided into those that optimize scene representation at test time and those that directly predict it through a feed-forward network [5,12,52,59,67,78]. ...
... NeRF [40] × × × InstantNGP [41] × × × 3DGS [22] × × × DeformableGS [77] × ✓ ✓ * DeformableNerf [43] × ✓ ✓ * PixelSplat [5] ✓ ✓ × GS-LRM [79] ✓ ✓ × Quark [14] ✓ ✓ × Ours ✓ ✓ ✓ ...
... Challenges Optimization-based methods [2,22,40,41,73] are not well-suited for this task, as they require extensive optimization over entire video sequences. Additionally, they perform poorly with sparse views and face challenges in representing long video sequences due to memory and capacity constraints. ...
Preprint
We study the problem of novel view streaming from sparse-view videos, which aims to generate a continuous sequence of high-quality, temporally consistent novel views as new input frames arrive. However, existing novel view synthesis methods struggle with temporal coherence and visual fidelity, leading to flickering and inconsistency. To address these challenges, we introduce history-awareness, leveraging previous frames to reconstruct the scene and improve quality and stability. We propose a hybrid splat-voxel feed-forward scene reconstruction approach that combines Gaussian Splatting to propagate information over time, with a hierarchical voxel grid for temporal fusion. Gaussian primitives are efficiently warped over time using a motion graph that extends 2D tracking models to 3D motion, while a sparse voxel transformer integrates new temporal observations in an error-aware manner. Crucially, our method does not require training on multi-view video datasets, which are currently limited in size and diversity, and can be directly applied to sparse-view video streams in a history-aware manner at inference time. Our approach achieves state-of-the-art performance in both static and streaming scene reconstruction, effectively reducing temporal artifacts and visual artifacts while running at interactive rates (15 fps with 350ms delay) on a single H100 GPU. Project Page: https://19reborn.github.io/SplatVoxel/
... In recent years, differentiable rendering has emerged as a promising technique to reconstruct 3D content for novel view synthesis [3,27,28,54], surface reconstruction [11,22,25,46], and animation [29,34,48]. In the field of edge reconstruction, several works introduced the idea of differentiable rendering to recover edge information [7,19,53]. ...
... Differentiable Rendering Methods. The advancement of differentiable rendering opens up a new way for 3D reconstruction [22,27,46]. Inspired by NeRF [27], NEF [53] proposes to optimize an implicit neural radiance field with a multi-view rendering loss. ...
... The advancement of differentiable rendering opens up a new way for 3D reconstruction [22,27,46]. Inspired by NeRF [27], NEF [53] proposes to optimize an implicit neural radiance field with a multi-view rendering loss. After training, 3D edge points are extracted from the radiance field and curves are fitted to these points. ...
Preprint
Edges are one of the most basic parametric primitives to describe structural information in 3D. In this paper, we study parametric 3D edge reconstruction from calibrated multi-view images. Previous methods usually reconstruct a 3D edge point set from multi-view 2D edge images, and then fit 3D edges to the point set. However, noise in the point set may cause gaps among fitted edges, and the recovered edges may not align with input multi-view images since the edge fitting depends only on the reconstructed 3D point set. To mitigate these problems, we propose SketchSplat, a method to reconstruct accurate, complete, and compact 3D edges via differentiable multi-view sketch splatting. We represent 3D edges as sketches, which are parametric lines and curves defined by attributes including control points, scales, and opacity. During edge reconstruction, we iteratively sample Gaussian points from a set of sketches and rasterize the Gaussians onto 2D edge images. Then the gradient of the image error with respect to the input 2D edge images can be back-propagated to optimize the sketch attributes. Our method bridges 2D edge images and 3D edges in a differentiable manner, which ensures that 3D edges align well with 2D images and leads to accurate and complete results. We also propose a series of adaptive topological operations and apply them along with the sketch optimization. The topological operations help reduce the number of sketches required while ensuring high accuracy, yielding a more compact reconstruction. Finally, we contribute an accurate 2D edge detector that improves the performance of both ours and existing methods. Experiments show that our method achieves state-of-the-art accuracy, completeness, and compactness on a benchmark CAD dataset.
... Radiance field models have seen a surge in popularity in recent years spurred by the development of neural radiance fields (NeRFs), which use a neural network to approximate the 5-dimensional radiance field function [28]. NeRFbased methods are able to leverage existing uncertainty quantification methods for neural networks including net- Figure 2. Images containing color, depth, semantics, or other features of interest, along with camera poses and an initial point cloud, are used to train a radiance field model depicted. ...
... Color: For color, the correlation coefficients for the Blender [28], Mip360 [2], and TUM [41] datasets for all methods are presented in Table 1. As shown in Table 1, our method outperforms the FisherRF and CF-NeRF baselines in terms of correlation and runtime for every dataset and has comparable performance to the 3DGS Ensemble in terms of correlation but requires an order-of-magnitude less time to run. ...
... We conduct experiments on the Blender dataset for the original NeRF [28] and iMAP [42] implementations, beginning with 200 initial rays per image and then re-sampling 1,024 rays at each iteration. All methods train for 200,000 iterations, with quantitative performance recorded every 1,000 iterations until convergence. ...
Preprint
Full-text available
This paper introduces a novel approach to uncertainty quantification for radiance fields by leveraging higher-order moments of the rendering equation. Uncertainty quantification is crucial for downstream tasks including view planning and scene understanding, where safety and robustness are paramount. However, the high dimensionality and complexity of radiance fields pose significant challenges for uncertainty quantification, limiting the use of these uncertainty quantification methods in high-speed decision-making. We demonstrate that the probabilistic nature of the rendering process enables efficient and differentiable computation of higher-order moments for radiance field outputs, including color, depth, and semantic predictions. Our method outperforms existing radiance field uncertainty estimation techniques while offering a more direct, computationally efficient, and differentiable formulation without the need for post-processing.Beyond uncertainty quantification, we also illustrate the utility of our approach in downstream applications such as next-best-view (NBV) selection and active ray sampling for neural radiance field training. Extensive experiments on synthetic and real-world scenes confirm the efficacy of our approach, which achieves state-of-the-art performance while maintaining simplicity.
... Recently, neural radiance fields (NeRF) [9] have been employed in audio-driven portrait generation [1,[17][18][19][20][21][22][23][24]. NeRF represents the 3D scene as a continuous function with points and viewing directions in 3D space as inputs. ...
... Nevertheless, the absence of 3D structure may result in jitter and artefacts, which fail to produce realistic images and a natural style. In recent years, researchers have conducted extensive research into the use of 3D methods [9][10][11][12][13][14][15][16] for the generation of talking portrait and achieve notable advancements in this field. ...
... In contrast, tremendous progress has been made in 3D-based talking face generation techniques [3,4,18], with the 3D morphable model (3DMM) [27,28] filling in the gaps in the 3D structure to the point of being able to produce more natural-looking videos of talking portrait. Many multi-stage methods [12,20] Recently, NeRF [9] has shown potential for critical applications in several fields [1,[17][18][19][20][21][22][23][24]. NeRF overcomes the problem of intermediate feature information loss that may occur during processing of traditional 3DMMs. ...
Article
Full-text available
Neural radiation field (NeRF) has been widely used in the field of talking portrait synthesis. However, the inadequate utilisation of audio information and spatial position leads to the inability to generate images with high audio‐lip consistency and realism. This paper proposes a novel tri‐plane dynamic neural radiation field (Tri‐NeRF) that employs an implicit radiation field to study the impacts of audio on facial movements. Specifically, Tri‐NeRF propose tri‐plane offset network (TPO‐Net) to offset spatial positions in three 2D planes guided by audio. This allows for sufficient learning of audio features from image features in a low dimensional state to generate more accurate lip movements. In order to better preserve facial texture details, we innovatively propose a new gated attention fusion module (GAF) to dynamically fuse features based on strong and weak correlation of cross‐modal features. Extensive experiments have demonstrated that Tri‐NeRF can generate talking portraits with audio‐lip consistency and realism.
... DL algorithms, such as Convolutional Neural Networks (CNNs) 32 and Generative Adversarial Networks (GANs) 33 , can be optimized to learn the underlying structure of the sample, generalizing over different similar samples, and produce highquality reconstructions from sparse inputs [34][35][36] . Specifically, approaches based on Neural Radiance Fields 37 have recently shown promise in optical and X-ray imaging for reconstructing high-resolution 3D/4D structures from sparse views [38][39][40][41][42][43][44] . Instead of relying on voxels, these methods learn the shape of an object as an implicit function of the 3D spatial coordinates, offering a potential solution to the longstanding memory issues associated with 3D reconstructions. ...
... We designed a self-supervised DL algorithm, 4D-ONIX, to reconstruct temporal and spatial information from XMPI. It combines neural implicit representation 37 and generative adversarial mechanism 33 with the physics of X-ray interaction with matter, resulting in a mapping between the spatialtemporal coordinates and the distribution of the refractive index of the sample. By enforcing consistency between the recorded projections and the estimated projections generated by the model, the model learns by itself the 3D volumetric information of the sample at each measured time point from only the given projections without needing real 3D information about the sample. ...
Article
Full-text available
The X-ray flux from X-ray free-electron lasers and storage rings enables new spatiotemporal opportunities for studying in-situ and operando dynamics, even with single pulses. X-ray multi-projection imaging is a technique that provides volumetric information using single pulses while avoiding the centrifugal forces induced by conventional time-resolved 3D methods like time-resolved tomography, and can acquire 3D movies (4D) at least three orders of magnitude faster than existing techniques. However, reconstructing 4D information from highly sparse projections remains a challenge for current algorithms. Here we present 4D-ONIX, a deep-learning-based approach that reconstructs 3D movies from an extremely limited number of projections. It combines the computational physical model of X-ray interaction with matter and state-of-the-art deep learning methods. We demonstrate its ability to reconstruct high-quality 4D by generalizing over multiple experiments with only two to three projections per timestamp on simulations of water droplet collisions and experimental data of additive manufacturing. Our results demonstrate 4D-ONIX as an enabling tool for 4D analysis, offering high-quality image reconstruction for fast dynamics three orders of magnitude faster than tomography.
... Seminal work by Ref. 1 demonstrated INR of volume density and color radiance, learned from five dimensions of spatial coordinates and orientation. Their approach uses positional encoding to improve the learning capacity of high-frequency content, addressing the issue of spectral bias in deep learning 8 . ...
... However, using AD with Pytorch, the implicit representation can be differentiated with respect to its coordinate inputs to provide spatial gradients, prior to being queried at the nodes of a quantized grid. That is, to directly sample the spatial gradient ∇φ(x, y, z) from the neural network model at (x, y, z) 1 ...(x, y, z) n . Second and higher order derivatives can be calculated using AD in the same manner, but are not presented here. ...
Article
Full-text available
The recent use of spatial coordinate features in multilayer perceptron (MLP) neural networks provides opportunities for novel applications in potential field geophysics. So-called coordinate MLP networks allow for learning a representative function of potential fields from their surveyed samples. We present a novel method for implicit neural representation of potential fields, demonstrate the quality of the learned implicit function by encoding synthetic and real airborne geophysical survey line data, and compare the result to grid data processed with traditional gridding methods. We further demonstrate the analytical calculation of gradients directly in the continuous domain of the neural network using automatic differentiation, with the same framework used to train the neural network representation. A regular grid created with the proposed method closely matches the ground truth reference synthetic forward model, with a root mean-square error of 10.3 nT, compared to 18.75 nT for minimum curvature. Horizontal gradients calculated with this method are accurate against numerically derived gradients, while the vertical gradient is poor for these case study data. The training process is rapid, and only requires recorded samples from a single survey extent.
... Compared to traditional methods, NeRF (Mildenhall et al. 2020) and subsequent related studies Wang et al. 2023b;Zhang et al. 2020) have achieved remarkable results in novel view synthesis. These methods exhibit characteristics such as continuity, differentiability, and smoothness. ...
... Among these, DNGaussian incorporates global-to-local depth regularization to mitigate radiance field geometric ambiguity. The study demonstrates strong performance in quantitative evaluations on the publicly available Blender dataset (Mildenhall et al. 2020), outperforming other methods in certain aspects. When compared to sparseNeRF (Wang et al. 2023a) and FreeNeRF (Yang, Pavone, and Wang 2023), DNGaussian achieves a 25-fold increase in training speed and a more than 3000-fold increase in rendering speed on the LLF dataset (Mildenhall et al. 2019). ...
Article
Full-text available
At present, the widely used traditional three-dimensional (3D) reconstruction techniques are still insufficient to adapt to various diverse scenarios. Compared to traditional methods, emerging technologies like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) for novel view synthesis offer more realistic and comprehensive expression capabilities. However, most of these related technologies still rely on traditional methods and require extensive and dense input views, which poses challenges to the reconstruction in the real-world scenarios. We propose MFGaussian, a framework based on 3DGS for 3D scene representation by fusing multi-modal data obtained from the Mobile laser scanning system (MLS) to achieve high robustness and accuracy even with limited input views. MFGaussian employs stepwise training approach to independently learn the global information and details of the scene. During pre-training, a substantial number of virtual training views are generated by projecting color point clouds, thereby enhancing the model's robustness. Subsequently, the model is fine-tuned using the original training views. This method initializes the laser point cloud as 3D Gaussian, obtains camera parameters through multi-sensor calibration and subsequent spherical interpolation, thus obtaining high-precision initial data without relying on Structure from Motion(SfM), and further ensures accurate geometric structure through the partial optimization. Furthermore, an analysis has been conducted on how variations in lighting brightness within the scene affect the view synthesis from diverse perspectives and positions, with an appearance model incorporated to eliminate the resulting color ambiguity. Our method, tested on our dataset and the ETH3D stereo benchmark, demonstrates enhanced capability and robustness of 3DGS in diverse scenarios without SfM or dense view inputs. It outperforms several state-of-the-art methods in both quantitative and qualitative evaluations. Our code will be open sourced soon later after the publication of this manuscript (https://github.com/oucliuyang/MFGaussian).
... Generating realistic 3D scenes from text has garnered increasing attention in AR, gaming, and robotics. Early works [41,92] primarily relied on Neural Radiance Fields (NeRF) [50,54,60] to model 3D scenes, but their computationally intensive volumetric rendering poses limitations for real-time rendering. Recently, 3D Gaussian Splatting (3DGS) [31] has emerged as a promising alternative, enabling real-time rendering while preserving highfidelity details. ...
... However, these approaches typically struggle to achieve photorealistic outcomes. Neural Radiance Fields (NeRF) [50,54] significantly advanced photorealism in 3D generation [9,60], but their computationally intensive volumetric rendering limits real-time applications. Recently, 3D Gaussian Splatting (3DGS) [31] has emerged as a promising alternative, enabling real-time rendering and high visual fidelity. ...
Preprint
Full-text available
We propose VideoRFSplat, a direct text-to-3D model leveraging a video generation model to generate realistic 3D Gaussian Splatting (3DGS) for unbounded real-world scenes. To generate diverse camera poses and unbounded spatial extent of real-world scenes, while ensuring generalization to arbitrary text prompts, previous methods fine-tune 2D generative models to jointly model camera poses and multi-view images. However, these methods suffer from instability when extending 2D generative models to joint modeling due to the modality gap, which necessitates additional models to stabilize training and inference. In this work, we propose an architecture and a sampling strategy to jointly model multi-view images and camera poses when fine-tuning a video generation model. Our core idea is a dual-stream architecture that attaches a dedicated pose generation model alongside a pre-trained video generation model via communication blocks, generating multi-view images and camera poses through separate streams. This design reduces interference between the pose and image modalities. Additionally, we propose an asynchronous sampling strategy that denoises camera poses faster than multi-view images, allowing rapidly denoised poses to condition multi-view generation, reducing mutual ambiguity and enhancing cross-modal consistency. Trained on multiple large-scale real-world datasets (RealEstate10K, MVImgNet, DL3DV-10K, ACID), VideoRFSplat outperforms existing text-to-3D direct generation methods that heavily depend on post-hoc refinement via score distillation sampling, achieving superior results without such refinement.
... However, these methods typically require large-scale image and SMPL pose/shape pairs for training, which demands labor-intensive annotation and often compromises generalizability under in-the-wild scenes. On the other hand, implicit representations like Neural Radiance Fields (NeRFs) [5,19] and point-based representations like 3D Gaussian Splatting (3DGS) [20,21] offer high visual-fidelity human reconstruction from monocular videos. However, they struggle with occlusions as they often require pixel-level fine details for subject-specific optimization, which can be largely affected by occlusion noises, as discussed in [12,22]. ...
... However, their framework presents two principal limitations: heavy reliance on precise SMPL mesh estimations, which are inherently challenging to acquire under occluded conditions, and lack of publicly available codes, obstructing both reproducibility and practical adoption. With advances in Neural Radiance Fields (NeRFs) [19] and 3D Gaussian Splatting (3DGS) [20], studies such as [5,21] have adapted these techniques for human reconstruction from monocular videos, though occlusion handling remains problematic, as evidenced in occlusion-robust variants [12,22,45]. Despite this progress, they typically require monocular video inputs and are optimized per subject, limiting their scalability and practical deployment. ...
Preprint
Full-text available
Reconstructing clothed humans from a single image is a fundamental task in computer vision with wide-ranging applications. Although existing monocular clothed human reconstruction solutions have shown promising results, they often rely on the assumption that the human subject is in an occlusion-free environment. Thus, when encountering in-the-wild occluded images, these algorithms produce multiview inconsistent and fragmented reconstructions. Additionally, most algorithms for monocular 3D human reconstruction leverage geometric priors such as SMPL annotations for training and inference, which are extremely challenging to acquire in real-world applications. To address these limitations, we propose CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-ConsistEncy from a Single Image, a novel pipeline designed to reconstruct occlusion-resilient 3D humans with multiview consistency from a single occluded image, without requiring either ground-truth geometric prior annotations or 3D supervision. Specifically, CHROME leverages a multiview diffusion model to first synthesize occlusion-free human images from the occluded input, compatible with off-the-shelf pose control to explicitly enforce cross-view consistency during synthesis. A 3D reconstruction model is then trained to predict a set of 3D Gaussians conditioned on both the occluded input and synthesized views, aligning cross-view details to produce a cohesive and accurate 3D representation. CHROME achieves significant improvements in terms of both novel view synthesis (upto 3 db PSNR) and geometric reconstruction under challenging conditions.
... There is ample literature on generating 3D scenes from textual or image prompts. Most such methods are image-based, and progressively reconstruct larger scene regions by expanding from an initial image [8,12,13,17,30,37,42,64,67], combining depth prediction, image and depth outpainting, and 3D reconstruction using NeRF [35] or 3D Gaussian Splatting [19]. The main advantage of these approaches is that they can leverage powerful 2D image generator models to create the first and subsequent views of the scene. ...
... The major challenges for these methods remain semantic drift and object permanence. To obtain an explicit 3D representation, the generated views need to be transferred into such a representation, e.g., NeRF [35] or Gaussians [18,19], where any geometric conflicts would need to be resolved. Figure 2. Overview of SynCity. ...
Preprint
Full-text available
We address the challenge of generating 3D worlds from textual descriptions. We propose SynCity, a training- and optimization-free approach, which leverages the geometric precision of pre-trained 3D generative models and the artistic versatility of 2D image generators to create large, high-quality 3D spaces. While most 3D generative models are object-centric and cannot generate large-scale worlds, we show how 3D and 2D generators can be combined to generate ever-expanding scenes. Through a tile-based approach, we allow fine-grained control over the layout and the appearance of scenes. The world is generated tile-by-tile, and each new tile is generated within its world-context and then fused with the scene. SynCity generates compelling and immersive scenes that are rich in detail and diversity.
... Novel view synthesis is a fundamental problem in computer vision due to its widespread applications, such as virtual reality, augmented reality, robotics and so on. Remarkable progress has been made using neural implicit representations [30,39,40], but these methods suffer from expensive time consumption in training and rendering [26,12,1,37,60,10,15,31]. Recently, 3D Gaussian Splatting (3DGS) [19] has drawn increasing attention for explicit Gaussian representations and real-time rendering performance. ...
... Early researches focus on capturing dense views to reconstruct scenes, while neural implicit representations have significantly advanced neural processing for 3D data and multi-view images, leading to high reconstruction and rendering quality [29,35,59,40]. In particular, Neural Radiance Fields (NeRF) [30] has garnered considerable attention with a fully connected neural network to represent complex 3D scenes. Subsequently, following works have emerged to address NeRF's limitations and enhance its performance. ...
Preprint
3D Gaussian Splatting (3DGS) has demonstrated impressive novel view synthesis performance. While conventional methods require per-scene optimization, more recently several feed-forward methods have been proposed to generate pixel-aligned Gaussian representations with a learnable network, which are generalizable to different scenes. However, these methods simply combine pixel-aligned Gaussians from multiple views as scene representations, thereby leading to artifacts and extra memory cost without fully capturing the relations of Gaussians from different images. In this paper, we propose Gaussian Graph Network (GGN) to generate efficient and generalizable Gaussian representations. Specifically, we construct Gaussian Graphs to model the relations of Gaussian groups from different views. To support message passing at Gaussian level, we reformulate the basic graph operations over Gaussian representations, enabling each Gaussian to benefit from its connected Gaussian groups with Gaussian feature fusion. Furthermore, we design a Gaussian pooling layer to aggregate various Gaussian groups for efficient representations. We conduct experiments on the large-scale RealEstate10K and ACID datasets to demonstrate the efficiency and generalization of our method. Compared to the state-of-the-art methods, our model uses fewer Gaussians and achieves better image quality with higher rendering speed.
... However, for larger-scale environments, our understanding tends to remain more coarse and generalized. Previous works, such as NeRF [23] and 3DGS [16], have demonstrated the Previous feature splatting works such as F-3DGS [51] and F-Splat [26] directly distill 2D feature maps obtained from foundation models into 3D Gaussians via differentiable rendering. We observe two key issues: First, due to the computational limitations, the feature vector dimensions in Gaussian primitives are significantly reduced compared to the original 2D feature maps (typically 16-64 versus 1024), potentially causing an information bottleneck. ...
... 3D Gaussians and Feature Field. NeRF [23] revolutionized 3D scene representation, but its implicit nature caused slow rendering and training. 3D Gaussian Splatting [16] (3DGS) emerged as a faster, more explicit alternative, enabling rapid training and real-time rendering. ...
Preprint
We present 3D Spatial MultiModal Memory (M3), a multimodal memory system designed to retain information about medium-sized static scenes through video sources for visual perception. By integrating 3D Gaussian Splatting techniques with foundation models, M3 builds a multimodal memory capable of rendering feature representations across granularities, encompassing a wide range of knowledge. In our exploration, we identify two key challenges in previous works on feature splatting: (1) computational constraints in storing high-dimensional features for each Gaussian primitive, and (2) misalignment or information loss between distilled features and foundation model features. To address these challenges, we propose M3 with key components of principal scene components and Gaussian memory attention, enabling efficient training and inference. To validate M3, we conduct comprehensive quantitative evaluations of feature similarity and downstream tasks, as well as qualitative visualizations to highlight the pixel trace of Gaussian memory attention. Our approach encompasses a diverse range of foundation models, including vision-language models (VLMs), perception models, and large multimodal and language models (LMMs/LLMs). Furthermore, to demonstrate real-world applicability, we deploy M3's feature field in indoor scenes on a quadruped robot. Notably, we claim that M3 is the first work to address the core compression challenges in 3D feature distillation.
... More recently, methods exploiting Neural Radiance Fields (NeRF) [20] have demonstrated significant potential to capture the geometric structure of scenes for 3D object detection [12,13,42]. For instance, NeRF-Det [42] predicts scene opacity through NeRF, which allows dynamic adjustment of voxel features. ...
... (1) Deficiency of 3D positional information: 2D image features lack the necessary spatial information for precise 3D object localization. (2) Insufficient scene geometry perception: NeRF [20] focuses on scene-level rendering, neglecting object-level details crucial for object detection. This results in imprecise opacity predictions, which leads to poor 3D object detection. ...
Preprint
We propose GO-N3RDet, a scene-geometry optimized multi-view 3D object detector enhanced by neural radiance fields. The key to accurate 3D object detection is in effective voxel representation. However, due to occlusion and lack of 3D information, constructing 3D features from multi-view 2D images is challenging. Addressing that, we introduce a unique 3D positional information embedded voxel optimization mechanism to fuse multi-view features. To prioritize neural field reconstruction in object regions, we also devise a double importance sampling scheme for the NeRF branch of our detector. We additionally propose an opacity optimization module for precise voxel opacity prediction by enforcing multi-view consistency constraints. Moreover, to further improve voxel density consistency across multiple perspectives, we incorporate ray distance as a weighting factor to minimize cumulative ray errors. Our unique modules synergetically form an end-to-end neural model that establishes new state-of-the-art in NeRF-based multi-view 3D detection, verified with extensive experiments on ScanNet and ARKITScenes. Code will be available at https://github.com/ZechuanLi/GO-N3RDet.
... Triangle meshes are a fundamental representation for 3D assets and are widely used across various industries, including virtual reality, gaming, and animation. These meshes can be either manually created by artists or automatically generated by applying Marching Cubes [33] to volumetric fields, such as Neural Radiance Fields (NeRF) [36] or Signed Distance Fields (SDF) [40]. Artist-crafted meshes typically exhibit well-optimized topology, which facilitates editing, deformation, and texture mapping. ...
... To minimize generation time, some approaches [9, 17, 27-29, 32, 46, 55, 65, 68, 73, 83] predict multi-view images and use reconstruction algorithms to produce 3D models. The Large Reconstruction Model (LRM) [15] proposes a transformerbased reconstruction model to predict NeRF representation [36] from single image within seconds. Subsequent research [21,49,52,58,64,71,72,79,80,87,88] further improve LRM's generation quality by incorporating multi-view images or other 3D representations [19]. ...
Preprint
Triangle meshes play a crucial role in 3D applications for efficient manipulation and rendering. While auto-regressive methods generate structured meshes by predicting discrete vertex tokens, they are often constrained by limited face counts and mesh incompleteness. To address these challenges, we propose DeepMesh, a framework that optimizes mesh generation through two key innovations: (1) an efficient pre-training strategy incorporating a novel tokenization algorithm, along with improvements in data curation and processing, and (2) the introduction of Reinforcement Learning (RL) into 3D mesh generation to achieve human preference alignment via Direct Preference Optimization (DPO). We design a scoring standard that combines human evaluation with 3D metrics to collect preference pairs for DPO, ensuring both visual appeal and geometric accuracy. Conditioned on point clouds and images, DeepMesh generates meshes with intricate details and precise topology, outperforming state-of-the-art methods in both precision and quality. Project page: https://zhaorw02.github.io/DeepMesh/
... Novel view synthesis is a widely discussed topic with various applications in video games, telepresence, sports broadcasting, and the metaverse. The past few years have witnessed remarkable progress in this domain, primarily due to the emergence of neural radiance fields (NeRF) [50] and 3D Gaussian Splatting (3DGS) [31]. However, human novel view synthesis still faces many challenges. ...
... Early methods [61,77,92] focus primarily on leveraging pixel-aligned image features to regress the signed distance field (SDF) under the supervision of ground-truth 3D models. Later, the emerging of neural radiance field (NeRF) [50] opens up a new paradigm to render high-fidelity human based only on sparse multi-view images [58,63,89] or even a monocular image/video [21,24,27,71]. Although NeRF-based methods have demonstrated impressive results in animating [38,39,45,69] and editing [10,12,13,76,79,81] human avatars, these methods typically fail to generalize to unseen subjects because of requiring per-subject optimization, and their rendering efficiency is low due to the inherent high computation to render each pixel. ...
Preprint
This paper presents RoGSplat, a novel approach for synthesizing high-fidelity novel views of unseen human from sparse multi-view images, while requiring no cumbersome per-subject optimization. Unlike previous methods that typically struggle with sparse views with few overlappings and are less effective in reconstructing complex human geometry, the proposed method enables robust reconstruction in such challenging conditions. Our key idea is to lift SMPL vertices to dense and reliable 3D prior points representing accurate human body geometry, and then regress human Gaussian parameters based on the points. To account for possible misalignment between SMPL model and images, we propose to predict image-aligned 3D prior points by leveraging both pixel-level features and voxel-level features, from which we regress the coarse Gaussians. To enhance the ability to capture high-frequency details, we further render depth maps from the coarse 3D Gaussians to help regress fine-grained pixel-wise Gaussians. Experiments on several benchmark datasets demonstrate that our method outperforms state-of-the-art methods in novel view synthesis and cross-dataset generalization. Our code is available at https://github.com/iSEE-Laboratory/RoGSplat.
... The emergence of Neural Radiance Fields (NeRF) [32] and 3D Gaussian Splatting (3DGS) [16] has significantly advanced 3D scene reconstruction, with widespread applications in 3D editing [4,11,26,44,46,55] and SLAM [15,31,36,41,64]. Recent research has extended NeRF and 3DGS to dynamic scenes [1,9,27,53,54,63], demonstrating promising results. ...
Preprint
3D Gaussian Splatting (3DGS) has shown remarkable potential for static scene reconstruction, and recent advancements have extended its application to dynamic scenes. However, the quality of reconstructions depends heavily on high-quality input images and precise camera poses, which are not that trivial to fulfill in real-world scenarios. Capturing dynamic scenes with handheld monocular cameras, for instance, typically involves simultaneous movement of both the camera and objects within a single exposure. This combined motion frequently results in image blur that existing methods cannot adequately handle. To address these challenges, we introduce BARD-GS, a novel approach for robust dynamic scene reconstruction that effectively handles blurry inputs and imprecise camera poses. Our method comprises two main components: 1) camera motion deblurring and 2) object motion deblurring. By explicitly decomposing motion blur into camera motion blur and object motion blur and modeling them separately, we achieve significantly improved rendering results in dynamic regions. In addition, we collect a real-world motion blur dataset of dynamic scenes to evaluate our approach. Extensive experiments demonstrate that BARD-GS effectively reconstructs high-quality dynamic scenes under realistic conditions, significantly outperforming existing methods.
... EulerFlow is expensive to optimize. With our implementation, optimizing EulerFlow for a single Argoverse 2 sequence takes 24 hours on one NVIDIA V100 16GB GPU, putting it on par with the original NeRF paper's computation expense (Mildenhall et al., 2021). However, like with NeRF, we believe algorithmic, optimization, and engineering improvements can significantly reduce runtime. ...
Preprint
Full-text available
Scene flow estimation is the task of describing 3D motion between temporally successive observations. This thesis aims to build the foundation for building scene flow estimators with two important properties: they are scalable, i.e. they improve with access to more data and computation, and they are flexible, i.e. they work out-of-the-box in a variety of domains and on a variety of motion patterns without requiring significant hyperparameter tuning. In this dissertation we present several concrete contributions towards this. In Chapter 1 we contextualize scene flow and its prior methods. In Chapter 2 we present a blueprint to build and scale feedforward scene flow estimators without requiring expensive human annotations via large scale distillation from pseudolabels provided by strong unsupervised test-time optimization methods. In Chapter 3 we introduce a benchmark to better measure estimate quality across diverse object types, better bringing into focus what we care about and expect from scene flow estimators, and use this benchmark to host a public challenge that produced significant progress. In Chapter 4 we present a state-of-the-art unsupervised scene flow estimator that introduces a new, full sequence problem formulation and exhibits great promise in adjacent domains like 3D point tracking. Finally, in Chapter 5 I philosophize about what's next for scene flow and its potential future broader impacts.
... They use surfels to represent the scene and reconstruct the scene with estimated laparoscope poses. Neural Radiance Fields (NeRF) (Mildenhall et al., 2021) based methods have been developed quickly in recent years for their amazing performance in novel scene synthesis . NeRF implicitly uses Multi-Layer Perceptrons (MLP) to represent the volume density and color for any 3D space location. ...
Preprint
Accurate 3D scene reconstruction is essential for numerous medical tasks. Given the challenges in obtaining ground truth data, there has been an increasing focus on self-supervised learning (SSL) for endoscopic depth estimation as a basis for scene reconstruction. While foundation models have shown remarkable progress in visual tasks, their direct application to the medical domain often leads to suboptimal results. However, the visual features from these models can still enhance endoscopic tasks, emphasizing the need for efficient adaptation strategies, which still lack exploration currently. In this paper, we introduce Endo3DAC, a unified framework for endoscopic scene reconstruction that efficiently adapts foundation models. We design an integrated network capable of simultaneously estimating depth maps, relative poses, and camera intrinsic parameters. By freezing the backbone foundation model and training only the specially designed Gated Dynamic Vector-Based Low-Rank Adaptation (GDV-LoRA) with separate decoder heads, Endo3DAC achieves superior depth and pose estimation while maintaining training efficiency. Additionally, we propose a 3D scene reconstruction pipeline that optimizes depth maps' scales, shifts, and a few parameters based on our integrated network. Extensive experiments across four endoscopic datasets demonstrate that Endo3DAC significantly outperforms other state-of-the-art methods while requiring fewer trainable parameters. To our knowledge, we are the first to utilize a single network that only requires surgical videos to perform both SSL depth estimation and scene reconstruction tasks. The code will be released upon acceptance.
... 3D generation with Gaussian splatting. Unlike 2D images, 3D objects are more difficult to generate due to the additional dimension and geometric constraints [15,28]. Among all 3D generation methods [4,16,18,34,47,48,51], there is a line of work that utilizes diffusion models to generate 3D Gaussians, which is closely related to this paper. ...
Preprint
Full-text available
Recent advances in text-to-image diffusion models have been driven by the increasing availability of paired 2D data. However, the development of 3D diffusion models has been hindered by the scarcity of high-quality 3D data, resulting in less competitive performance compared to their 2D counterparts. To address this challenge, we propose repurposing pre-trained 2D diffusion models for 3D object generation. We introduce Gaussian Atlas, a novel representation that utilizes dense 2D grids, enabling the fine-tuning of 2D diffusion models to generate 3D Gaussians. Our approach demonstrates successful transfer learning from a pre-trained 2D diffusion model to a 2D manifold flattened from 3D structures. To support model training, we compile GaussianVerse, a large-scale dataset comprising 205K high-quality 3D Gaussian fittings of various 3D objects. Our experimental results show that text-to-image diffusion models can be effectively adapted for 3D content generation, bridging the gap between 2D and 3D modeling.
... Some approaches [32,53,77] incorporate neural rendering techniques to enhance face rendering quality. However, due to the inherent limitations of mesh representations, it is still hard to capture detailed facial motions and model non-facial regions Recent advancements have leveraged NeRF [47] to achieve photo-realistic head rendering. Some methods integrate parametric facial priors with MLPs [3,28,102], tri-planes [10,39,41,44,58,71,84,87], manifolds [72] or volumetric primitives [6] to enhance controllability and disentanglement within radiance fields. ...
Preprint
Recent advances in diffusion models have made significant progress in digital human generation. However, most existing models still struggle to maintain 3D consistency, temporal coherence, and motion accuracy. A key reason for these shortcomings is the limited representation ability of commonly used control signals(e.g., landmarks, depth maps, etc.). In addition, the lack of diversity in identity and pose variations in public datasets further hinders progress in this area. In this paper, we analyze the shortcomings of current control signals and introduce a novel control signal representation that is optimizable, dense, expressive, and 3D consistent. Our method embeds a learnable neural Gaussian onto a parametric head surface, which greatly enhances the consistency and expressiveness of diffusion-based head models. Regarding the dataset, we synthesize a large-scale dataset with multiple poses and identities. In addition, we use real/synthetic labels to effectively distinguish real and synthetic data, minimizing the impact of imperfections in synthetic data on the generated head images. Extensive experiments show that our model outperforms existing methods in terms of realism, expressiveness, and 3D consistency. Our code, synthetic datasets, and pre-trained models will be released in our project page: https://ustc3dv.github.io/Learn2Control/
... We present the first method capable of generating consistent 360-degree views from a single portrait of a head, accommodating human, stylized, or animal anthropomorphic forms, as well as accessories like glasses jewelries, or masks. Our novel views can be computed offline (5.6 sec per view) and transformed into a high-quality neural radiance field (NeRF) [33], enabling free-viewpoint rendering in real-time. Our method is particularly robust, capable of handling a wide range of subjects, complex hairstyles, varied head poses, and expressive facial features, including detailed elements like tongues. ...
Preprint
Generating high-quality 360-degree views of human heads from single-view images is essential for enabling accessible immersive telepresence applications and scalable personalized content creation. While cutting-edge methods for full head generation are limited to modeling realistic human heads, the latest diffusion-based approaches for style-omniscient head synthesis can produce only frontal views and struggle with view consistency, preventing their conversion into true 3D models for rendering from arbitrary angles. We introduce a novel approach that generates fully consistent 360-degree head views, accommodating human, stylized, and anthropomorphic forms, including accessories like glasses and hats. Our method builds on the DiffPortrait3D framework, incorporating a custom ControlNet for back-of-head detail generation and a dual appearance module to ensure global front-back consistency. By training on continuous view sequences and integrating a back reference image, our approach achieves robust, locally continuous view synthesis. Our model can be used to produce high-quality neural radiance fields (NeRFs) for real-time, free-viewpoint rendering, outperforming state-of-the-art methods in object synthesis and 360-degree head generation for very challenging input portraits.
... Diffusion models [9] have demonstrated their strong capability in image generation and video synthesis [15], paving the way for diffusion-based 3D content creation. Various studies [10,32,33] have been dedicated to distilling consistent 3D representations, such as neural radiance fields (NeRF) [30] or 3D Gaussian splatting [12], from 2D image diffusion models or vision language models using the Score Distillation Sampling (SDS) loss [32]. Although these methods yield visually pleasing results, the distillation process tends to be time-intensive for generating a single shape, requires intricate parameter tuning to obtain satisfactory quality, and often faces issues with unstable convergence and quality degradation. ...
Preprint
Full-text available
We present Acc3D to tackle the challenge of accelerating the diffusion process to generate 3D models from single images. To derive high-quality reconstructions through few-step inferences, we emphasize the critical issue of regularizing the learning of score function in states of random noise. To this end, we propose edge consistency, i.e., consistent predictions across the high signal-to-noise ratio region, to enhance a pre-trained diffusion model, enabling a distillation-based refinement of the endpoint score function. Building on those distilled diffusion models, we propose an adversarial augmentation strategy to further enrich the generation detail and boost overall generation quality. The two modules complement each other, mutually reinforcing to elevate generative performance. Extensive experiments demonstrate that our Acc3D not only achieves over a 20×20\times increase in computational efficiency but also yields notable quality improvements, compared to the state-of-the-arts.
... In contrast to these methods focusing on sparse matches, our work finds dense maps of contextually corresponding regions, ac- Neural Fields Neural fields are spatio-temporal quantities that are parameterized fully or partially by a neural network [121]. Prominent applications of neural fields include photorealistic 3D reconstruction [15,36,70], 3D geometry extraction [113,126], and SLAM [111,112,132]. While these studies primarily focus on visual fidelity and geometric accuracy, more recent works apply neural fields to semantic scene understanding [28,42,130] and robot motion planning [97-99, 114, 115]. ...
Preprint
Full-text available
Understanding scene contexts is crucial for machines to perform tasks and adapt prior knowledge in unseen or noisy 3D environments. As data-driven learning is intractable to comprehensively encapsulate diverse ranges of layouts and open spaces, we propose teaching machines to identify relational commonalities in 3D spaces. Instead of focusing on point-wise or object-wise representations, we introduce 3D scene analogies, which are smooth maps between 3D scene regions that align spatial relationships. Unlike well-studied single instance-level maps, these scene-level maps smoothly link large scene regions, potentially enabling unique applications in trajectory transfer in AR/VR, long demonstration transfer for imitation learning, and context-aware object rearrangement. To find 3D scene analogies, we propose neural contextual scene maps, which extract descriptor fields summarizing semantic and geometric contexts, and holistically align them in a coarse-to-fine manner for map estimation. This approach reduces reliance on individual feature points, making it robust to input noise or shape variations. Experiments demonstrate the effectiveness of our approach in identifying scene analogies and transferring trajectories or object placements in diverse indoor scenes, indicating its potential for robotics and AR/VR applications.
... In particular, we consider the generative NVS (GNVS) task conditioned on a single view. Standard "interpolative" NVS settings (e.g., NeRFs [25] and 3DGS [18]) operate with many images, resolving scene geometry (up to a global scale) and allowing NVS for that scene in its arbitrary coordinate system. In contrast, a GNVS model faces an under-determined scenario: given a single observed image and a trajectory of camera parameters, it must generate novel views for that camera path. ...
Preprint
Conventional depth-free multi-view datasets are captured using a moving monocular camera without metric calibration. The scales of camera positions in this monocular setting are ambiguous. Previous methods have acknowledged scale ambiguity in multi-view data via various ad-hoc normalization pre-processing steps, but have not directly analyzed the effect of incorrect scene scales on their application. In this paper, we seek to understand and address the effect of scale ambiguity when used to train generative novel view synthesis methods (GNVS). In GNVS, new views of a scene or object can be minimally synthesized given a single image and are, thus, unconstrained, necessitating the use of generative methods. The generative nature of these models captures all aspects of uncertainty, including any uncertainty of scene scales, which act as nuisance variables for the task. We study the effect of scene scale ambiguity in GNVS when sampled from a single image by isolating its effect on the resulting models and, based on these intuitions, define new metrics that measure the scale inconsistency of generated views. We then propose a framework to estimate scene scales jointly with the GNVS model in an end-to-end fashion. Empirically, we show that our method reduces the scale inconsistency of generated views without the complexity or downsides of previous scale normalization methods. Further, we show that removing this ambiguity improves generated image quality of the resulting GNVS model.
... Recent advancements in 3D Gaussian Splatting (3DGS) [13] have enabled efficient, high-quality rendering by modeling deformable targets with continuous and differentiable Gaussian primitives. Unlike NeRF-based [18] methods [4,6,7], which suffer from slow inference due to dense ray sampling and multilayer perceptron (MLP) computations, 3DGS offers real-time rendering capability while preserving high visual quality. Leveraging these advantages, 3DGS has been extended to articulated human bodies and hands [8,11,22,24,27,30], where representations are initialized in a canonical space and deformed into the posed space through rigid skeletal motion. ...
Preprint
Existing 3D Gaussian Splatting (3DGS) methods for hand rendering rely on rigid skeletal motion with an oversimplified non-rigid motion model, which fails to capture fine geometric and appearance details. Additionally, they perform densification based solely on per-point gradients and process poses independently, ignoring spatial and temporal correlations. These limitations lead to geometric detail loss, temporal instability, and inefficient point distribution. To address these issues, we propose HandSplat, a novel Gaussian Splatting-based framework that enhances both fidelity and stability for hand rendering. To improve fidelity, we extend standard 3DGS attributes with implicit geometry and appearance embeddings for finer non-rigid motion modeling while preserving the static hand characteristic modeled by original 3DGS attributes. Additionally, we introduce a local gradient-aware densification strategy that dynamically refines Gaussian density in high-variation regions. To improve stability, we incorporate pose-conditioned attribute regularization to encourage attribute consistency across similar poses, mitigating temporal artifacts. Extensive experiments on InterHand2.6M demonstrate that HandSplat surpasses existing methods in fidelity and stability while achieving real-time performance. We will release the code and pre-trained models upon acceptance.
... This method interprets 2-D images as projections of a 3-D volume, allowing more accurate and isotropic reconstruction of high-frequency structures. Using a neural network to map spatial coordinates to density values [21], CryoNeFEN can resolve finer details and dynamic states of macromolecules with superior resolution. In general, spatial domain methods [20,22] have shown significant improvements in reconstructing 3-D structures with high-resolution, demonstrating the potential of integrating advanced computational techniques with Cryo-EM data. ...
Preprint
Full-text available
In the field of structural biology, Cryo-EM based high-resolution 3-D structure reconstruction of complex macromolecules is a vital step. Although multiple attempts have been tried within this framework to consider quality-degrading factors such as imaging noise, non-uniform distribution of particle orientations, and sample heterogeneity in order to achieve high resolution, there is still a substantial gap between the best reconstruction resolution achieved by the existing methods and the hard resolution provided by the imaging device. Here, we introduce CryoGS, a novel 3-D reconstruction method for Cryo-EM structures using Gaussian splatting. Through the integration of 3-D Gaussian representations into neural network learning, CryoGS employs a spatial domain approach to optimize learnable 3-D Gaussians and project them into 2-D images using the splatting technique. Compared with the existing methods, CryoGS achieves significant improvements in resolution, isotropy, and computational efficiency. For example, CryoGS achieves a resolution of 2.217Å on EMPIAR-10492 dataset, approaching its theoretical limit of 2.2Å, while the best resolution achieved by the existing methods is 3.805Å. Furthermore, CryoGS exhibits remarkable robustness in reconstructing heterogeneous structures and high-resolution models under extreme conditions such as pose inaccuracy, limited particle data, and high noise. Based on these results, we believe that CryoGS has great potential to be a powerful tool for Cryo-EM applications to ensure enhanced resolution, robustness, and efficiency.
... Neural rendering synthesizes novel views or edited scenes from volumetric or surface-based data, bridging computer graphics and machine learning. NeRFs [Mildenhall et al. 2021] first demonstrated high-fidelity view synthesis by mapping continuous 3D coordinates to density and color, later extended to dynamic scenes [Park et al. 2021], and anti-aliased systems for unbounded environments [Barron et al. 2022]. Surface-based methods [Lombardi et al. 2019;Tretschk et al. 2020] utilize explicit geometry or point-based representations for realistic rendering but often demand substantial data and computation, complicating fine-grained edits. ...
Preprint
Despite recent advances in text-to-image generation, controlling geometric layout and material properties in synthesized scenes remains challenging. We present a novel pipeline that first produces a G-buffer (albedo, normals, depth, roughness, and metallic) from a text prompt and then renders a final image through a modular neural network. This intermediate representation enables fine-grained editing: users can copy and paste within specific G-buffer channels to insert or reposition objects, or apply masks to the irradiance channel to adjust lighting locally. As a result, real objects can be seamlessly integrated into virtual scenes, and virtual objects can be placed into real environments with high fidelity. By separating scene decomposition from image rendering, our method offers a practical balance between detailed post-generation control and efficient text-driven synthesis. We demonstrate its effectiveness on a variety of examples, showing that G-buffer editing significantly extends the flexibility of text-guided image generation.
... Many methods exist for 3D reconstruction, which include Structure from Motion (SfM) (Ullman 1979), Space Carving (Kutulakos and Seitz 1999), and Neural Radiance Fields (NeRF) (Mildenhall et al. 2020). Unfortunately, SfM and NeRF prove difficult to use because of the complex and sparse nature of root and shoot morphology. ...
Article
Full-text available
Accurate and nondestructive estimation of plant biomass is crucial for optimizing plant productivity, but existing methods are often expensive and require complex experimental setups. To address this challenge, we developed an automated system for estimating plant root and shoot biomass over the plant's lifecycle in hydroponic systems. This system employs a robotic arm and turntable to capture 40 images at equidistant angles around a hydroponically grown lettuce plant. These images are then processed into silhouettes and used in voxel‐based volumetric 3D reconstruction to produce detailed 3D models. We utilize a space carving method along with a raytracing‐based optical correction technique to create high‐accuracy reconstructions. Analysis of these models demonstrates that our system accurately reconstructs the plant root structure and provides precise measurements of root volume, which can be calibrated to indicate biomass.
... Recently, neural radiance fields (NeRFs) [45] have reinvigorated the field of novel-view synthesis due to their high-quality results and conceptual simplicity. Mip-NeRF 360 [3] showed the first results on 360°scenes, although the camera is always inward-facing. ...
Article
Full-text available
360° images are a popular medium for bringing photography into virtual reality. While users can look in any direction by rotating their heads, 360° images ultimately look flat. That is because they lack depth information and thus cannot create motion parallax when translating the head. To achieve a fully immersive VR experience from a single 360° image, we introduce a novel method to upgrade 360° images to free-viewpoint renderings with 6 degrees of freedom. Alternative approaches reconstruct textured 3D geometry, which is fast to render but suffers from visible reconstruction artifacts, or use neural radiance fields that produce high-quality novel views but too slowly for VR applications. Our 360° 3D photos build on 3D Gaussian splatting as the underlying scene representation to simultaneously achieve high visual quality and real-time rendering speed. To fill plausible content in previously unseen regions, we introduce a novel combination of latent diffusion inpainting and monocular depth estimation with Poisson-based blending. Our results demonstrate state-of-the-art visual and depth quality at rendering rates of 105 FPS per megapixel on a commodity GPU.
... Radiance field emerges as a promising representation for reconstructing 3D scenes with various properties, e.g., geometries, colors, and semantics, from only 2D inputs such as RGB images and segmentation masks. Neural Radiance Field (NeRF) [25] models the radiance field using a neural network composed of layers of multilayer perceptrons. Since then, various works attempt to improve the efficiency of NeRF, e.g., by explicitly formulating the field using 3D structures such as voxels [2,22] and hash grids [27]. ...
Preprint
Full-text available
Lifting multi-view 2D instance segmentation to a radiance field has proven to be effective to enhance 3D understanding. Existing methods rely on direct matching for end-to-end lifting, yielding inferior results; or employ a two-stage solution constrained by complex pre- or post-processing. In this work, we design a new end-to-end object-aware lifting approach, named Unified-Lift that provides accurate 3D segmentation based on the 3D Gaussian representation. To start, we augment each Gaussian point with an additional Gaussian-level feature learned using a contrastive loss to encode instance information. Importantly, we introduce a learnable object-level codebook to account for individual objects in the scene for an explicit object-level understanding and associate the encoded object-level features with the Gaussian-level point features for segmentation predictions. While promising, achieving effective codebook learning is non-trivial and a naive solution leads to degraded performance. Therefore, we formulate the association learning module and the noisy label filtering module for effective and robust codebook learning. We conduct experiments on three benchmarks: LERF-Masked, Replica, and Messy Rooms datasets. Both qualitative and quantitative results manifest that our Unified-Lift clearly outperforms existing methods in terms of segmentation quality and time efficiency. The code is publicly available at \href{https://github.com/Runsong123/Unified-Lift}{https://github.com/Runsong123/Unified-Lift}.
Preprint
Implicit neural representations (INRs) encode signals in neural network weights as a memory-efficient representation, decoupling sampling resolution from the associated resource costs. Current INR image classification methods are demonstrated on low-resolution data and are sensitive to image-space transformations. We attribute these issues to the global, fully-connected MLP neural network architecture encoding of current INRs, which lack mechanisms for local representation: MLPs are sensitive to absolute image location and struggle with high-frequency details. We propose ARC: Anchored Representation Clouds, a novel INR architecture that explicitly anchors latent vectors locally in image-space. By introducing spatial structure to the latent vectors, ARC captures local image data which in our testing leads to state-of-the-art implicit image classification of both low- and high-resolution images and increased robustness against image-space translation. Code can be found at https://github.com/JLuij/anchored_representation_clouds.
Article
Emphasizing self-improvement, simulation-driven refinement, and reduced human oversight for autonomous machines development.
Article
Full-text available
Simultaneous Localization and Mapping (SLAM) is a crucial technology for intelligent unnamed systems to estimate their motion and reconstruct unknown environments. However, the SLAM systems with merely one sensor have poor robustness and stability due to the defects in the sensor itself. Recent studies have demonstrated that SLAM systems with multiple sensors, mainly consisting of LiDAR, camera, and IMU, achieve better performance due to the mutual compensation of different sensors. This paper investigates recent progress on multi-sensor fusion SLAM. The review includes a systematic analysis of the advantages and disadvantages of different sensors and the imperative of multi-sensor solutions. It categorizes multi-sensor fusion SLAM systems into four main types by the fused sensors: LiDAR-IMU SLAM, Visual-IMU SLAM, LiDAR-Visual SLAM, and LiDAR-IMU-Visual SLAM, with detailed analysis and discussions of their pipelines and principles. Meanwhile, the paper surveys commonly used datasets and introduces evaluation metrics. Finally, it concludes with a summary of the existing challenges and future opportunities for multi-sensor fusion SLAM.
Article
The paper presents an efficient light field image synthesis method based on single-viewpoint images, which can directly generate high-quality light field images from single-viewpoint input images. The proposed method integrates light field image encoding with the tiled rendering technique of 3DGS. In the construction of the rendering pipeline, a viewpoint constraint strategy is adopted to optimize rendering quality, and a sub-pixel rendering strategy is implemented to improve rendering efficiency. Experimental results demonstrate that 8K light field images with 96 viewpoints can be generated in real time from end to end. The research presented in the paper provides a new approach for the real-time generation of high-resolution light field images, advancing the application of light field display technology in low-cost environments.
Conference Paper
Full-text available
Incremental Structure-from-Motion is a prevalent strategy for 3D reconstruction from unordered image collections. While incremental reconstruction systems have tremendously advanced in all regards, robustness, accuracy , completeness, and scalability remain the key problems towards building a truly general-purpose pipeline. We propose a new SfM technique that improves upon the state of the art to make a further step towards this ultimate goal. The full reconstruction pipeline is released to the public as an open-source implementation.
Conference Paper
Full-text available
Inverse graphics attempts to take sensor data and infer 3D geometry, illumination, materials, and motions such that a graphics renderer could realistically reproduce the observed scene. Renderers, however, are designed to solve the forward process of image synthesis. To go in the other direction, we propose an approximate di�fferentiable renderer (DR) that explicitly models the relationship between changes in model parameters and image observations. We describe a publicly available OpenDR framework that makes it easy to express a forward graphics model and then automatically obtain derivatives with respect to the model parameters and to optimize over them. Built on a new autodiff�erentiation package and OpenGL, OpenDR provides a local optimization method that can be incorporated into probabilistic programming frameworks. We demonstrate the power and simplicity of programming with OpenDR by using it to solve the problem of estimating human body shape from Kinect depth and RGB data.
Article
Full-text available
We describe an image based rendering approach that generalizes many current image based rendering algorithms, including light field rendering and view-dependent texture mapping. In particular, it allows for lumigraph-style rendering from a set of input cameras in arbitrary configurations (i.e., not restricted to a plane or to any specific manifold). In the case of regular and planar input camera positions, our algorithm reduces to a typical lumigraph approach. When presented with fewer cameras and good approximate geometry, our algorithm behaves like view-dependent texture mapping. The algorithm achieves this flexibility because it is designed to meet a set of specific goals that we describe. We demonstrate this flexibility with a variety of examples. Engineering and Applied Sciences
Chapter
We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x, y, z) and viewing direction ) and whose output is the volume density and view-dependent emitted radiance at that spatial location. We synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. We describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis. View synthesis results are best viewed as videos, so we urge readers to view our supplementary video for convincing comparisons.
Conference Paper
This paper presents new algorithms to trace objects represented by densities within a volume grid, e.g. clouds, fog, flames, dust, particle systems. We develop the light scattering equations, discuss previous methods of solution, and present a new approximate solution to the full three-dimensional radiative scattering problem suitable for use in computer graphics. Additionally we review dynamical models for clouds used to make an animated movie.
Article
This tutorial survey paper reviews several different models for light interaction with volume densities of absorbing, glowing, reflecting, and/or scattering material. They are, in order of increasing realism, absorption only, emission only, emission and absorption combined, single scattering of external illumination without shadows, single scattering with shadows, and multiple scattering. For each model the paper provides the physical assumptions, describes the applications for which it is appropriate, derives the differential or integral equations for light transport, presents calculation methods for solving them, and shows output images for a data set representing a cloud. Special attention is given to calculation methods for the multiple scattering model
Article
A number of techniques have been developed for reconstructing surfaces by integrating groups of aligned range images. A desirable set of properties for such algorithms includes: incremental updating, representation of directional uncertainty, the ability to fill gaps in the reconstruction, and robustness in the presence of outliers. Prior algorithms possess subsets of these properties. In this paper, we present a volumetric method for integrating range images that possesses all of these properties.
Article
A surface light field is a function that assigns a color to each ray originating on a surface. Surface light fields are well suited to constructing virtual images of shiny objects under complex lighting conditions. This paper presents a framework for construction, compression, interactive rendering, and rudimentary editing of surface light fields of real objects. Generalizations of vector quantization and principal component analysis are used to construct a compressed representation of an object's surface light field from photographs and range scans. A new rendering algorithm achieves interactive rendering of images from the compressed representation, incorporating view-dependent geometric level-of-detail control. The surface light field representation can also be directly edited to yield plausible surface light fields for small changes in surface geometry and reflectance properties.
Article
We present a new approach for modeling and rendering existing architectural scenes from a sparse set of still photographs. Our modeling approach, which combines both geometry-based and imagebased techniques, has two components. The first component is a photogrammetricmodeling methodwhich facilitates the recovery of the basic geometry of the photographed scene. Our photogrammetric modeling approach is effective, convenient, and robust because it exploits the constraints that are characteristic of architectural scenes. The second component is a model-based stereo algorithm, which recovers how the real scene deviates from the basic model. By making use of the model, our stereo technique robustly recovers accurate depth from widely-spaced image pairs. Consequently, our approach can model large architectural environments with far fewer photographs than current image-based modeling approaches. For producing renderings, we present view-dependent texture mapping, a method of compositing multiple views of a scene that better simulates geometric detail on basic models. Our approach can be used to recover models for use in either geometry-based or image-based rendering systems. We present results that demonstrate our approach 's ability to create realistic renderings of architectural scenes from viewpoints far from the original photographs. CR Descriptors: I.2.10 [Artificial Intelligence]: Vision and Scene Understanding - Modeling and recovery of physical attributes; I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism - Color, shading, shadowing, and texture I.4.8 [Im- age Processing]: Scene Analysis - Stereo; J.6 [Computer-Aided Engineering]: Computer-aided design (CAD). 1